Swift String extensions tokenize() and titleCase()

Gary Bartos
4 min readFeb 4, 2023

If you’re working with String variables in Swift, and especially if you’re working with Natural Language Processing (NLP) using the Natural Language framework, you’ll end up splitting sentences into individual words.

If you’ve been splitting your String variables by whitespace and punctuation, this post is for you!

Maybe your chatbot needs to split user input text for comparison to keywords such as “account” or “help.” Perhaps you’re creating a concordance of words from the Discworld series by Terry Pratchett. Or maybe whenever someone says the words “Simon” to your app, your app meows. I’m not here to judge.

“There were other dragons — gold, silver, black, …” and so on. I’m not going to spoil the book if you haven’t read it. Which I recommend. And now Penguin UK has new audio books of the series with voice actors like Bill Nighy and Andy Serkis. Nice!
Quote from The Colour of Magic by Terry Pratchett

Credit goes to developers whose code I found while googling for improvements to my own code. The code below includes link to the sources.

String.tokenize() : split a sentence into words

Søren L Kristiansen wrote a 2015 post, “Three Ways to Enumerate the Words In a String Using Swift,” that explains a key point clearly:

Do Not Separate by Whitespace and Punctuation

Splitting by whitespace and punctuation is exactly what I’ve been doing for an initial implementation, and I was glad to stumble onto Søren’s post.

For Søren’s code to run in Xcode 14.2 and Swift 5.7, a few minor tweaks were necessary:

// Swift 5.7 updates to code from https://medium.com/@sorenlind/three-ways-to-enumerate-the-words-in-a-string-using-swift-7da5504f0062
extension String
{
/// Split the string into tokens, such as splitting a sentence into words
public func tokenize() -> [String] {
let inputRange = CFRangeMake(0, self.utf16.count)
let flag = UInt(kCFStringTokenizerUnitWord)
let locale = CFLocaleCopyCurrent()

let cf = self as CFString

let tokenizer = CFStringTokenizerCreate( kCFAllocatorDefault, cf, inputRange, flag, locale)
var tokenType = CFStringTokenizerAdvanceToNextToken(tokenizer)
var tokens : [String] = []

while !tokenType.isEmpty
{
let currentTokenRange = CFStringTokenizerGetCurrentTokenRange(tokenizer)
let substring = self.substringWithRange(aRange: currentTokenRange)
tokens.append(substring)
tokenType = CFStringTokenizerAdvanceToNextToken(tokenizer)
}

return tokens
}

/// Called from tokenize().
private func substringWithRange(aRange : CFRange) -> String {

let nsrange = NSMakeRange(aRange.location, aRange.length)
let substring = (self as NSString).substring(with: nsrange)
return substring
}
}

The tokenize() function is scoped as public. Having a public access level makes working in a Xcode playground a little tidier: create a new file in the Sources folder, copy in a function, and make the function public. Then you can call the function from the playground.

Here’s a quick playground test of tokenize() with a combination of real words, words with apostrophes, incorrect capitalization, and so on.

import Foundation

var texts = [String]()

texts.append("Hello, playground, ol' buddy!")
texts.append("Nelly book? scramozzle scHlump boo: ipseum gypsum lorem est.")
texts.append("Name: name, Aftername: surname")
texts.append("Don't stop thinking about tomorrow, it's on its way. S'mores for the taking?!?")

for text in texts {
let tokens = text.tokenize()

print()
print(text)
print(tokens)
}

In the output note how the single quote / apostrophe is escaped. (QWERTY keyboards typically have just one key for what can be two different characters.)

Hello, playground, ol' buddy!
["Hello", "playground", "ol", "buddy"]

Nelly book? scramozzle scHlump boo: ipseum gypsum lorem est.
["Nelly", "book", "scramozzle", "scHlump", "boo", "ipseum", "gypsum", "lorem", "est"]

Name: name, Aftername: surname
["Name", "name", "Aftername", "surname"]

Don't stop thinking about tomorrow, it's on its way. S'mores for the taking?!?
["Don\'t", "stop", "thinking", "about", "tomorrow", "it\'s", "on", "its", "way", "S\'mores", "for", "the", "taking"]

String.titleCase() : convert from camelCase to Title Case

You may have variable names in camelCase that you’d like to present in Title Case, with the words split and capitalized.

import Foundation

extension String {
/// Changes camel case to title case.
/// https://stackoverflow.com/questions/41292671/separating-camelcase-string-into-space-separated-words-in-swift
/// For case names:
/// https://danielmiessler.com/blog/a-list-of-different-case-types/
/// https://winnercrespo.com/naming-conventions/
public func titleCase() -> String {
return self
.replacingOccurrences(of: "([A-Z])",
with: " $1",
options: .regularExpression,
range: range(of: self))
.trimmingCharacters(in: .whitespacesAndNewlines)
.capitalized // If input is in llamaCase
}
}

Playground test code includes an enum with a static names() function. The names() function generates a list of properly capitalized names of Flipmode Square band members.

import Foundation

// camelCased texts
var texts = [String]()

texts.append("variableName")
texts.append("camelCaseChangeMe")
texts.append("mistYped")


for text in texts {
let t = text.titleCase()
print(t)
}

print()

// enum with camelCased names
enum FlipmodeSquad : String, CaseIterable {
case babySham
case bustaRhymes
case lordHaveMercy
case rahDigga
case spliffStar

static func names() -> [String] {
FlipmodeSquad.allCases.map { $0.rawValue.titleCase() }
}
}

let fs = FlipmodeSquad.names()

for member in fs {
print(member)
}

Output

Note that “mistYped,” an intentionally mistyped word with the “y” capitalized, will be split into two words.

Variable Name
Camel Case Change Me
Mist Yped

Baby Sham
Busta Rhymes
Lord Have Mercy
Rah Digga
Spliff Star

Should the titleCase() function be written as TitleCase()? Hmm.

--

--

Gary Bartos

Founder of Echobatix, developing assistive technology for the blind. echobatix@gmail.com