ThoughtVector

Narrative and Better Bets

One thing I love about Yobi (my job) is one of its core values: make smart bets. It’s one of those kinda reductive statements that requires a lot of context to be practically instructive, but fulfills its role in creating rhetorical focus. Everything we do is a bet. bets have risks, upsides, and downsides. Different bets interact and inform future bets. Given your portfolio of bets, what is your expected distribution of outcomes? Etc. These aren’t the capped ludic bets of the casino, where all probabilities are known and there are no inter-bet interactions. They are the bets of the natural world, where bets build on each other and can lead to compounding results. Some bets have catastrophic downside (mostly bad bets), and some have incredible upside (some good bets). Narrative has a critical role in bet making: most good options are hidden to almost everyone - imagine the startup opportunity, which relies on key recent tech advancements, changes in market dynamics, a coalescing of talent, a unique framing of the problem, etc. Predicting success (binary or degree) for this venture relies on a lot of information that simply hasn’t happened yet. ...

Why Write?

Hey all, it’s Stuart, probably showing up in your inbox. I’m back, writing my little thoughts, and would love if you’d lend an ear. What is it, a blog? A substack? Not exactly - I’m not building an audience, I’m not promoting anything - but I am hoping to inspire conversation and build meaning / crystalize ideas that are meaningful to me. Practically, I’ll just be sending out emails whenever I write something or put together a photo album of some life experience, and want to use a direct channel to share it not mediated by capitalism. ...

Subword Tokenization - Handling Misspellings and Multilingual Data

This blog post is summarizes a talk given by Stuart Axelbrooke at NLSea. Tokenization, the process of grouping text into meaningful chunks like words, is a very important step in natural language processing systems. It makes challenges like classification and clustering much easier, but tokenization can be frustratingly arbitrary compared to other parts of the pipeline. In consumer domains, traditional tokenization strategies are particularly problematic: misspellings, rare words, and multilingual sources can make effective tokenization a very difficult part of the pipeline to optimize. ...

Topic Modeling for Humans with Phrase Detection

Topic models help us uncover the immense value lurking in the copius text data now flowing through the internet. Unfortunately, most topic models are too vague to help humans dive in and diagnose specific problems. In this post we’ll create understandable and actionable topic models using partitioned phrase detection, and discuss its application in text analytics systems for summarizing things like forums, reviews, and social media data. ❌️ PhraseSignificanceRating picture quality0.207 4.331 smart tv0.146 4.228 living room0.135 4.482 surround sound0.132 4.536 ✔️ The Goal: Specificity, Summary, and Depth LDA is the preeminent topic modeling technique, which learns topics as distributions of words - essentially a large set of word clouds, like the one seen above. LDA-like topic models have problems with specificity and understanding, though, which we seek to solve: ...

Don't Overfit

Overfitting is a subject that isn’t discussed nearly enough. In machine learning, overfitting is when an algorithm learns a model so specialized that it is unable to generalize or handle new tasks. This sounds domain-specific, but the idea can also describe many mistakes humans make in software, design, and life in general! Looking at how we deal with this problem in machine learning can help us be more systematic about avoiding it in these other arenas. ...

Kubernetes Cron Container

So, you’ve got a Kubernetes cluster, and a cron task you need to run. Running it on your machine is an obviously bad idea, as is shoe horning it into another machine in your cloud fleet. Wouldn’t it be great to run this daily/hourly/whateverly task in Kubernetes? … Yes! Yes it would be. And you can! It took a little digging, but setting up a cron container is actually very simple: ...

Undeleting Kafka Topics

Accidentally deleted a topic, but hadn’t set delete.topic.enable in your server.properties file, and need to undo the topic deletion? Just delete the topic deletion in Zookeeper! Just ssh into your Zookeeper machine, and start up its CLI with $ZOOKEEPER_HOME/zkCli.sh. Topics marked for deletion are stored in /admin/delete_topics/$YOUR_TOPIC_NAME, so to “undelete” it, run delete /admin/delete_topics/$YOUR_TOPIC_NAME. And there you go, your topic is saved! Kafka version of 0.8.2 - your mileage may vary for higher versions! ...

Simple, Clean Python Deploys with Anaconda

Deploying Python projects can be a pain - especially with Python 3.5. Anaconda is the emerging replacement for pip/virtualenv deploys, with its scope expanding past Python packages to binaries like redis. Tragically, its documentation is still… Maturing. With conda deploys being such a huge feature for me, I had to write about it! In this post I’ll talk about deploying on Amazon EC2 AMIs (running Amazon Linux), but in theory should apply to all Unix platforms where Anaconda runs. ...

Reducers - A Productive Stream Processing Pattern

The genesis of the software industry to stream processing is well underway. Open source systems like Kafka handle huge throughputs with surprisingly few resources, and aid heavily in decomposing monoliths into micro-services. When developers and engineers first step into this world of stream processing, though, there can be some uncertainty: How do you create succinct, resilient, and performant components of this system? How do they come together to form the larger system? How do you get answers without querying a database? ...

LDA Alpha and Beta Parameters - The Intuition

Latent Dirichlet Allocation (LDA) is a fantastic tool for topic modeling, but its alpha and beta hyperparameters cause a lot of confusion to those coming to the model for the first time (say, via an open source implementation like Python’s gensim). They can be defined simply, and depend on your symmetry assumption: Symmetric Distribution If you don’t know whether your LDA distribution is symmetric or asymmetric, it’s most likely symmetric. Here, alpha represents document-topic density - with a higher alpha, documents are made up of more topics, and with lower alpha, documents contain fewer topics. Beta represents topic-word density - with a high beta, topics are made up of most of the words in the corpus, and with a low beta they consist of few words. ...