Subword Tokenization - Handling Misspellings and Multilingual Data

Tokenization, the process of grouping text into meaningful chunks like words, is a very important step in natural language processing systems. It makes challenges like classification and clustering much easier, but tokenization can be frustratingly arbitrary compared to other parts of the pipeline. In consumer domains, traditional tokenization strategies are particularly problematic: misspellings, rare words, and multilingual sources can make effective tokenization a very difficult part of the pipeline to optimize....

Need feedback auto-tagging? Try Taggit

Topic Modeling for Humans with Phrase Detection

Topic models help us uncover the immense value lurking in the copius text data now flowing through the internet. Unfortunately, most topic models are too vague to help humans dive in and diagnose specific problems. In this post we’ll create understandable and actionable topic models using partitioned phrase detection, and discuss its application in text analytics systems for summarizing things like forums, reviews, and social media data. The Goal: Specificity, Summary, and Depth LDA is the preeminent topic modeling technique, which learns topics as distributions of words - essentially a large set of word clouds, like the one seen above....

Don't Overfit

Overfitting is a subject that isn’t discussed nearly enough. In machine learning, overfitting is when an algorithm learns a model so specialized that it is unable to generalize or handle new tasks. This sounds domain-specific, but the idea can also describe many mistakes humans make in software, design, and life in general! Looking at how we deal with this problem in machine learning can help us be more systematic about avoiding it in these other arenas....

Kubernetes Cron Container

So, you’ve got a Kubernetes cluster, and a cron task you need to run. Running it on your machine is an obviously bad idea, as is shoe horning it into another machine in your cloud fleet. Wouldn’t it be great to run this daily/hourly/whateverly task in Kubernetes? … Yes! Yes it would be. And you can! It took a little digging, but setting up a cron container is actually very simple:...

Undeleting Kafka Topics

Accidentally deleted a topic, but hadn’t set delete.topic.enable in your server.properties file, and need to undo the topic deletion? Just delete the topic deletion in Zookeeper! Just ssh into your Zookeeper machine, and start up its CLI with $ZOOKEEPER_HOME/zkCli.sh. Topics marked for deletion are stored in /admin/delete_topics/$YOUR_TOPIC_NAME, so to “undelete” it, run delete /admin/delete_topics/$YOUR_TOPIC_NAME. And there you go, your topic is saved! Kafka version of 0.8.2 - your mileage may vary for higher versions!...

Simple, Clean Python Deploys with Anaconda

Deploying Python projects can be a pain - especially with Python 3.5. Anaconda is the emerging replacement for pip/virtualenv deploys, with its scope expanding past Python packages to binaries like redis. Tragically, its documentation is still… Maturing. With conda deploys being such a huge feature for me, I had to write about it! In this post I’ll talk about deploying on Amazon EC2 AMIs (running Amazon Linux), but in theory should apply to all Unix platforms where Anaconda runs....

Reducers - A Productive Stream Processing Pattern

The genesis of the software industry to stream processing is well underway. Open source systems like Kafka handle huge throughputs with surprisingly few resources, and aid heavily in decomposing monoliths into micro-services. When developers and engineers first step into this world of stream processing, though, there can be some uncertainty: How do you create succinct, resilient, and performant components of this system? How do they come together to form the larger system?...

LDA Alpha and Beta Parameters - The Intuition

Latent Dirichlet Allocation (LDA) is a fantastic tool for topic modeling, but its alpha and beta hyperparameters cause a lot of confusion to those coming to the model for the first time (say, via an open source implementation like Python’s gensim). They can be defined simply, and depend on your symmetry assumption: Symmetric Distribution If you don’t know whether your LDA distribution is symmetric or asymmetric, it’s most likely symmetric. Here, alpha represents document-topic density - with a higher alpha, documents are made up of more topics, and with lower alpha, documents contain fewer topics....

The Last Analytics Company

The analytics market is crowded - there are countless companies offering nearly identical services. What’s worse, the technical task of recording analytics has become easy: many technologies throughout the stack make collecting, storing, and analyzing granular data accessible to layman software engineers. Somehow, though, a clear winner hasn’t emerged in the analytics space. The stage is set for the last analytics company. The promise of analytics tools is to help gather and analyze data recorded from products and services to yield invaluable business insights....

Python 3 on Spark - Return of the PYTHONHASHSEED

If you’re anything like me, you’ve been stuck using Python 2 for the last 10 years, and for 8 of them you’ve been trying to switch to 3. Since the release of Spark and PySpark 1.4, Apache has started supporting Python 3 - fantastic! But then appears the omen of doom, like Death cackling at your final moments: Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED Try in vain you may to set PYTHONHASHSEED: export it on the master?...