Topic models help us uncover the immense value lurking in the copius text data now flowing through the internet. Unfortunately, most topic models are too vague to help humans dive in and diagnose specific problems.

In this post we’ll create understandable and actionable topic models using partitioned phrase detection, and discuss its application in text analytics systems for summarizing things like forums, reviews, and social media data.

The Goal: Specificity, Summary, and Depth

LDA is the preeminent topic modeling technique, which learns topics as distributions of words - essentially a large set of word clouds, like the one seen above. LDA-like topic models have problems with specificity and understanding, though, which we seek to solve:

Word distributions leave only vague ideas of the “why”

Even with phrases as part of the word distribution, it is hard to tell what the primary drivers of the distribution are. Helping humans make Go/No-Go style decisions quickly in their exploration is critical for helping them get to the bottom of issues quickly - and LDA underperforms here.

Randomness creates non-stationary topics, preventing indexing

Using search and aggregation systems like Elasticsearch is critical for providing exploratory depth, and these systems rely on indexing to summarize and query large sets of data. Topics that move around every time you calculate them means you have to reindex the whole dataset every time you learn a new model - which is usually not realistic for datasets of interesting size.

Can’t dig into data quickly

These topic models generally also provide two levels of depth: the topic-level, and the document level, with nothing between. This means, once you have a hunch, you have to start reading documents to validate it. Most hunches require some small amount of validation, though, and we benefit from seeing other correlations at higher than document level before we commit to reading singular documents.

Phrases, on the other hand, hit all of these points:

  1. A phrase generally has explicit meaning (“car crash”, “doesnt work”)
  2. Phrases are inherently stationary, the words that embody them don’t change
  3. Calculating correlated phrases is easy, and provides explicit context for confident exploration

Learning Phrases

Using phrases as a topic model instead let’s us use the presence of a phrase in a document as it being a member of that topic. For example, a review from the Amazon reviews dataset, with learned phrases highlighted:

We can see useful keywords in there, like “DVR” and “playback”, but the found phrases like remote viewing and Blue Iris provide deeper, more specific subjects, while being low enough cardinality to keep our indexes reasonable.

To learn and label these phrases we’ll use Collocation, which works by comparing the frequency of word pairs to their individual word frequencies. Simply put, a pair of words will be learned as a phrase - like “remote viewing” - if the two words together are common enough relative to them showing up alone in other cases. This helps discourage discovery of phrases with common words like “a” and “the”, which generally dont add much understanding, and ensures that rare words aren’t penalized for occurring less. gensim, the library we’ll be using to learn phrases, implements this as such:

phrase?(a, b) = (f(a, b) - min_count) / f(a) / f(b) > threshold

Where f(a, b) is the count of words a and b together, f(a) is the count of word a individually, and threshold is score threshold that controls how much more common that the individual words the combination needs to be - generally it’s around 10.

Past labeling documents with the phrases, we can use significance scoring against the rest of the documents to highlight the phrases that most define a subset of the data, helping humans dig into the driving factors of behavior they want to understand or affect. This strategy of using significance scoring to pick phrases to describe a subset ends up being very versatile, providing valuable summaries even down into the tens of documents being summarized.

Learning Phrases With Partitioned Collocation

There’s just one piece left here: collocation starts to provide worse results as it is applied to larger and larger datasets. Generally what happens is that the data scientist uses collocation on a small sample dataset, finds good results, and applies it to the whole dataset. However, upon inspecting the phrase-labeled documents, she finds that they suddenly lack the specific phrases that were most useful in the small dataset experiment.

What is happening here is that the text data is not actually from a single distribution, but from many. The data scientist likely built the small dataset from one category with a relatively consistent word-pair distribution. Unfortunately, applying a single collocation model to all of the categories has the same effect of mixing too many paints: you’re left with muddy results lacking the interesting bits that made each category special.

Fortunately, there is an exceedingly simple solution to this: learning a phrase model for each partition of the data. In learning phrases from the Amazon reviews dataset, partitioning by product category and review rating proves to maintain interesting phrases as you scale to include more categories:

Reviews Partitioning First Document Phrases
10k None 16 channel 4 channels as advertised blue iris didn’t expect port forwarding remote viewing static ip third party will likely
1000k None blue iris port forwarding static ip third party
1000k Category, Rating 16 channel 4 channels as advertised blue iris ip address port forwarding remote viewing static ip they advertise third party

Above we tabulate the data size, partitioning, and phrases that match the first document, demonstrating how phrases are lost as more categories are added. Blue phrases are those learned by the unpartitioned model on the whole dataset, with orange as the others. Showing the phrases labeled inline, colored accordingly:

In this partitioned learning strategy, each document matches at least one partition - for instance, the “Camera” category and the the “1 Star” category - and the review text is added to a dataset for each partition. Then, phrase models are learned for each partition, and combined at the end. Luckily, these partitions are usually obvious to humans, so fairly easy to pick the correct partitioning functions ahead of time.

Phrase Significance

The cherry on top for phrase topic modeling comes with significance scoring. With per-phrase significance scoring, A non-technical human can describe some behavior they want to explain via a query to a search engine, and have the results explained with the phrases that show up more in their results than those that don’t match.

It is not the most common phrases that are usually interesting, but rather the phrases that most correlate with the behavior in question. By indexing the documents in Elasticsearch, these correlations can be approximated with the “significant terms” query, showing not just what phrases are special about a category, but also about an arbitrary group of documents that match a query we provide.

The Elasticsearch + phrase detection backend can be used with great affect to help humans understand and take action on all kinds of “text with metadata”-type data. It can be applied to help forum managers moderate and understand their community, social media managers understand the subjects their following cares most about, and product managers prioritize a backlog or dig into a drop in sales or ratings.