Swift String extensions tokenize() and titleCase()
If you’re working with String variables in Swift, and especially if you’re working with Natural Language Processing (NLP) using the Natural Language framework, you’ll end up splitting sentences into individual words.
If you’ve been splitting your String variables by whitespace and punctuation, this post is for you!
Maybe your chatbot needs to split user input text for comparison to keywords such as “account” or “help.” Perhaps you’re creating a concordance of words from the Discworld series by Terry Pratchett. Or maybe whenever someone says the words “Simon” to your app, your app meows. I’m not here to judge.
Credit goes to developers whose code I found while googling for improvements to my own code. The code below includes link to the sources.
String.tokenize() : split a sentence into words
Søren L Kristiansen wrote a 2015 post, “Three Ways to Enumerate the Words In a String Using Swift,” that explains a key point clearly:
Do Not Separate by Whitespace and Punctuation
Splitting by whitespace and punctuation is exactly what I’ve been doing for an initial implementation, and I was glad to stumble onto Søren’s post.
For Søren’s code to run in Xcode 14.2 and Swift 5.7, a few minor tweaks were necessary:
// Swift 5.7 updates to code from https://medium.com/@sorenlind/three-ways-to-enumerate-the-words-in-a-string-using-swift-7da5504f0062
extension String
{
/// Split the string into tokens, such as splitting a sentence into words
public func tokenize() -> [String] {
let inputRange = CFRangeMake(0, self.utf16.count)
let flag = UInt(kCFStringTokenizerUnitWord)
let locale = CFLocaleCopyCurrent()
let cf = self as CFString
let tokenizer = CFStringTokenizerCreate( kCFAllocatorDefault, cf, inputRange, flag, locale)
var tokenType = CFStringTokenizerAdvanceToNextToken(tokenizer)
var tokens : [String] = []
while !tokenType.isEmpty
{
let currentTokenRange = CFStringTokenizerGetCurrentTokenRange(tokenizer)
let substring = self.substringWithRange(aRange: currentTokenRange)
tokens.append(substring)
tokenType = CFStringTokenizerAdvanceToNextToken(tokenizer)
}
return tokens
}
/// Called from tokenize().
private func substringWithRange(aRange : CFRange) -> String {
let nsrange = NSMakeRange(aRange.location, aRange.length)
let substring = (self as NSString).substring(with: nsrange)
return substring
}
}
The tokenize() function is scoped as public. Having a public access level makes working in a Xcode playground a little tidier: create a new file in the Sources folder, copy in a function, and make the function public. Then you can call the function from the playground.
Here’s a quick playground test of tokenize() with a combination of real words, words with apostrophes, incorrect capitalization, and so on.
import Foundation
var texts = [String]()
texts.append("Hello, playground, ol' buddy!")
texts.append("Nelly book? scramozzle scHlump boo: ipseum gypsum lorem est.")
texts.append("Name: name, Aftername: surname")
texts.append("Don't stop thinking about tomorrow, it's on its way. S'mores for the taking?!?")
for text in texts {
let tokens = text.tokenize()
print()
print(text)
print(tokens)
}
In the output note how the single quote / apostrophe is escaped. (QWERTY keyboards typically have just one key for what can be two different characters.)
Hello, playground, ol' buddy!
["Hello", "playground", "ol", "buddy"]
Nelly book? scramozzle scHlump boo: ipseum gypsum lorem est.
["Nelly", "book", "scramozzle", "scHlump", "boo", "ipseum", "gypsum", "lorem", "est"]
Name: name, Aftername: surname
["Name", "name", "Aftername", "surname"]
Don't stop thinking about tomorrow, it's on its way. S'mores for the taking?!?
["Don\'t", "stop", "thinking", "about", "tomorrow", "it\'s", "on", "its", "way", "S\'mores", "for", "the", "taking"]
String.titleCase() : convert from camelCase to Title Case
You may have variable names in camelCase that you’d like to present in Title Case, with the words split and capitalized.
import Foundation
extension String {
/// Changes camel case to title case.
/// https://stackoverflow.com/questions/41292671/separating-camelcase-string-into-space-separated-words-in-swift
/// For case names:
/// https://danielmiessler.com/blog/a-list-of-different-case-types/
/// https://winnercrespo.com/naming-conventions/
public func titleCase() -> String {
return self
.replacingOccurrences(of: "([A-Z])",
with: " $1",
options: .regularExpression,
range: range(of: self))
.trimmingCharacters(in: .whitespacesAndNewlines)
.capitalized // If input is in llamaCase
}
}
Playground test code includes an enum with a static names() function. The names() function generates a list of properly capitalized names of Flipmode Square band members.
import Foundation
// camelCased texts
var texts = [String]()
texts.append("variableName")
texts.append("camelCaseChangeMe")
texts.append("mistYped")
for text in texts {
let t = text.titleCase()
print(t)
}
print()
// enum with camelCased names
enum FlipmodeSquad : String, CaseIterable {
case babySham
case bustaRhymes
case lordHaveMercy
case rahDigga
case spliffStar
static func names() -> [String] {
FlipmodeSquad.allCases.map { $0.rawValue.titleCase() }
}
}
let fs = FlipmodeSquad.names()
for member in fs {
print(member)
}
Output
Note that “mistYped,” an intentionally mistyped word with the “y” capitalized, will be split into two words.
Variable Name
Camel Case Change Me
Mist Yped
Baby Sham
Busta Rhymes
Lord Have Mercy
Rah Digga
Spliff Star
Should the titleCase() function be written as TitleCase()? Hmm.