NLP Stop Words as a Swift Array (or set)
If you’re doing work in Natural Language Processing (NLP), then you’ve probably come across the concept of stop words. Those are common words you’ll filter out before additional processing.
For example, let’s say you have a sentence such as
“Show me the dog.”
Remove the stop words “me” and “the” and what’s left is this:
“Show dog.”
You’ll find libraries and examples of stop word removal for Python. As of January 2023, though, there’s no list of stop words in Apple’s NaturalLanguage framework accessible to developers.
Stop words vary by application and by language. If you’re looking for a starter list of stop words in English, read on.
For Python programmers there’s the Natural Language Toolkit (NLTK) and a free online book called Natural Language Processing with Python. Nice! From there you’ll find the list of NLTK stop words on GitHub.
In the spirit of making Swift programming one quantum easier, here’s a copy-and-pasteable list of NLTK stop words as a Swift array
let stopWords: [String] = ["i", "me", "my", "myself", "we", "our", "ours",
"ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him",
"his", "himself", "she", "her", "hers", "herself", "it", "its", "itself",
"they", "them", "their", "theirs", "themselves", "what", "which", "who",
"whom", "this", "that", "these", "those", "am", "is", "are", "was", "were",
"be", "been", "being", "have", "has", "had", "having", "do", "does", "did",
"doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until",
"while", "of", "at", "by", "for", "with", "about", "against", "between",
"into", "through", "during", "before", "after", "above", "below", "to",
"from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
"further", "then", "once", "here", "there", "when", "where", "why", "how",
"all", "any", "both", "each", "few", "more", "most", "other", "some", "such",
"no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s",
"t", "can", "will", "just", "don", "should", "now"]
lowercased()
Be sure to convert your text using String.lowercased(). Capitalized and non-capitalized words are not the same. To work with the stop words list, all of your input words should be lowercased first.
Set<String>
You may want to implement stop words as a Set<String>. Finding and removing items from a set is typically an O(1) operation.
For example, the subtract(_:) method allows for quick removal of one set from another set. If the words in your source text are a Set<String>, you can subtract a Set<String> of the stop words.
Set<String> does not maintain order of elements. If you want the functionality of sets, but wish to maintain word order, then there are some handy NS types.
NSOrderedSet and NSMutableOrderedSet
To retain the order of words in text after removing stop words, you might use NSOrderedSet and NSMutableOrderedSet.
After parsing the original text such as “Show me the dog” into an array of individual words, initialize an NSMutableOrderedSet with the array of words. Then use the method minus(_ other: NSOrderedSet) to remove stop words from your text. Since the NS sets can contain any object, and not just strings, you’ll need code to create a String array and/or String of the text without stop words.
//... parse text into an array of words, [String]
let stopSet = NSOrderedSet(array: stopWords)
let wordSet = NSMutableOrderedSet(array: words)
wordSet.minus(stopSet)
//now wordSet no longer contains stop words
//convert back to text
let arr = wset.array.compactMap { $0 as? String }
return arr.joined(separator: " ")
Although wordSet is declared with let, suggesting it’s immutable, you can call minus(_:) and modify the contents.
Quick Method to Load a File
Here’s my quick method to create the stop words array.
- Copy and paste the NLTK word list into a text file
- In an Xcode playground, add the text file to Resources
- Implement a function to read the text file
- Print the array
- Copy and pasted the printed array output, and assign it to a let (immutable variable)
import Foundation
public func readStopWords() -> [String] {
var filenames = [String]()
let types = ["txt"]
for type in types {
filenames.append(contentsOf: Bundle.main.paths(forResourcesOfType: type, inDirectory: nil))
}
guard let filename = filenames.first else {
return []
}
let contents = try! String(contentsOfFile: filename)
let array = contents.components(separatedBy: "\n").filter { $0.count > 0 }
return array
}
let stopWords = readStopWords()
print(stopWords)
The for type in types loop above is unnecessary. It’s something I copied & pasted from other code I use to read image resources of various types from the Resources folder of the iOS playground. I’m leaving the code there in case I want to read and combine multiple files of stop words, possibly saved in different formats such as .txt, .text, .csv, and so on.