Tokenization, the process of grouping text into meaningful chunks like words, is a very important step in natural language processing systems. It makes challenges like classification and clustering much easier, but tokenization can be frustratingly arbitrary compared to other parts of the pipeline. In consumer domains, traditional tokenization strategies are particularly problematic: misspellings, rare words, and multilingual sources can make effective tokenization a very difficult part of the pipeline to optimize.

Subword tokenization is a recent strategy from machine translation that helps us solve these problems by breaking unknown words into “subword units” - strings of characters like ing or eau - that still allow the downstream model to make intelligent decisions on words it doesn’t recognize. In this blog post, we’ll look at subword tokenization, a few common strategies for it, and explore connections with other text analysis techniques and potential ways for existing tokenization algorithms to be improved.

Basic Tokenization Diagram

In the abstract, tokenization is all about grouping items in sequential data to make a downstream model’s job easier. For text, the sequence items are characters, but tokenization is useful for all kinds of sequential data. In general, this grouping is valuable as it lets us reduce inter-token reference length, so that models don’t have to remember information from quite so far back in the sequence.

Also valuable is how it allows us to move our data into a sparser, more semantically rich space. For example, singular characters don’t really mean much - most words have an e in them, and the presence of e in data doesn’t tell us anything specific. However, if we group characters into words, each token is much more meaningful: president, vote, and escape are just a few examples of e words whose meaning becomes much more explicit after word tokenization. In grouping these characters into words, though, we increase the number of dimensions our model has to understand significantly, which can be hard to accommodate when dealing with consumer text data.

The challenge here is that many different strategies exist for tokenizing data, and most rely on heuristics or knowledge about specific languages, and don’t work with other languages. Also, these strategies rarely accommodate mistakes in the data, which are common when dealing with all but the cleanest text datasets. These different tokenization strategies also have significant impacts on downstream model performance, but can also have unforeseen negative effects when normalization like correcting spelling and lowercasing is too aggressive.

Subword Tokens Diagram

Subword tokenization is a strategy that embraces the messiness in the data. When a user types something in all caps, that contains information about how the customer was feeling. When they add emphasis to words by misspelling them or adding extra letters, that gives you a more nuanced view into the sentiment they intended when writing that text. Subword tokenization learns to group tokens from the data itself, and generally maintains this nuance so that downstream models get the more meaningful space provided by word tokenization while still allowing the model to understand unknown words and misspellings by breaking them into subword tokens.

The simplest subword tokenization strategy is simply character tokenization: taking the text data one character at a time without doing any grouping. While extreme, this has proven to be a workable strategy, first employed by Karpathy in his character-level language model. Later, this was employed by OpenAI in their mLSTM-based model, which achieved state-of-the-art performance in sentiment classification using language model pretraining and fine-tuning on very little in-domain data.

Unfortunately, this simplistic strategy has a serious drawback: it puts the burden of grouping characters on the neural net, requiring it to have significantly more parameters to allow the model to perform the conceptual grouping internally, instead of being handed the groups upfront. This makes for much more expensive training and usage of the model. It is easy, though, to rationalize how the character tokenized mLSTM model was able to achieve these impressive results, since it handles misspellings like veeeerrry and ecxellent just like it does properly spelled words. Generally character tokenization is not used for modern neural nets doing things like machine translation or text classification, since generally higher performance can be achieved with other strategies.

Byte Pair Encoding

Byte Pair Encoding (BPE) is a very common subword tokenization technique, as it strikes a good balance between performance and tokenization quality. Originally a data compression technique, it relies on counting the most common strings of bytes from data, and then replacing those strings with signifiers from the learned vocabulary. While its not the most effective data compression technique, it is simple and effective for preprocessing data for neural networks, and has had particular success in machine translation, where unrecognized words are particularly problematic.

While naive on its face, BPE learns complete words and phrases provided enough data, even for non-whitespace separated languages like Japanese, which are traditionally very difficult to tokenize well. While it can increase the performance of NLP systems processing messy data, it does have a shortcoming related to its eager encoding of data: since BPE takes the longest recognized string of characters, at times it can result in less-than-ideal tokenization of input data, and tokenization of specific words or phrases can be different depending on where they lie in a sentence. It’s relative effectiveness means it is the chosen tokenization strategy for many modern language models like BERT and its derivatives, GPT2, and many other recent models.

Sentencepiece

Sentencepiece is a less common, but very beautiful strategy for subword tokenization. While conceptually similar to BPE, it strays from the eager encoding strategy to allow it to achieve higher quality tokenization and reduce error induced by location-dependence seen in BPE. The main difference is how Sentencepiece views ambiguity in character grouping, seeing it as a source of regularization for downstream models during training, as well as its use of a simple language model to evaluate the most likely character groupings instead of eagerly picking the longest recognized string like BPE does. This results in very high quality tokenization, but comes at great cost to performance, at times making it the slowest part of an NLP pipeline. While the assumption of ambiguity in tokenization seems natural, it appears the performance trade-off is not worth it, as Google itself opted not to use this strategy in their BERT language model.

After discussing these different options, there are some tempting similarities. If we squint a little, all of these subword tokenization strategies seem to share a common framework for understanding and transforming sequential data:

  1. Generate n-grams: scanning over your data and looking at every 1-to-N length group of characters.
  2. Count these n-grams, generally requiring regular vocab pruning to avoid out-of-memory issues.
  3. After counting n-grams from a large amount of in-domain sequential data (like text), n-grams are scored, and the highest scoring n-grams are exported as a vocabulary defining all the recognized groups of characters.
  4. To encode, a tokenizer then uses this vocabulary to pick the most informative character groupings, emitting those groups as tokens.

Different subword tokenization strategies change this in subtle ways - BPE scores tokens simply with their count, and uses eager encoding, while character tokenization simple has a max n-gram size of 1. This strategy sounds strikingly similar to phrase extraction using collocation, which counts word n-grams, scores them, and exports a model in similar fashion, except the sequence items are words instead of characters. And it turns out, using a phrase extraction library, you can learn an effective BPE vocabulary for languages like Japanese which aren’t whitespace separated by simply changing the word-regex to single characters, helping give this connection some credibility.

A finding like this begs the question - should BPE have a more intelligent scoring phase when choosing its vocab from n-gram counts? Mutual information as a scoring function was very valuable in achieving better phrase extraction, and it seems that it would be useful here as well. And even more than that, is there an abstract tokenizer we can define that fulfills each of these strategies and more via hyperparameters like n-gram length, scoring strategy, etc? Could this abstract tokenizer have a common implementation across platforms like Tensorflow, PyTorch, the web, etc?

These are questions that deserve real research, as tokenization choices still feel relatively arbitrary relative to other choices in ML pipelines relative to their impact on performance. Hopefully this post has provided some insight into modern tokenization techniques, and inspires more work into intelligent tokenization of messy data!