Latent Dirichlet Allocation (LDA) is a fantastic tool for topic modeling, but its alpha and beta hyperparameters cause a lot of confusion to those coming to the model for the first time (say, via an open source implementation like Python’s gensim). They can be defined simply, and depend on your symmetry assumption:
If you don’t know whether your LDA distribution is symmetric or asymmetric, it’s most likely symmetric. Here, alpha represents document-topic density - with a higher alpha, documents are made up of more topics, and with lower alpha, documents contain fewer topics. Beta represents topic-word density - with a high beta, topics are made up of most of the words in the corpus, and with a low beta they consist of few words.
Asymmetric distributions are similar, but slightly different: higher alpha results in a more specific topic distribution per document. Likewise, beta results in a more specific word distribution per topic.
In general, higher alpha values mean documents contain more similar topic contents. The same is true for beta, but with topics and words: generally a high beta will result in topics with more similar word contents. Also, an asymmetric alpha is helpful, where as an asymmetric beta is largely not.
The analytics market is crowded - there are countless companies offering nearly identical services. What’s worse, the technical task of recording analytics has become easy: many technologies...
By Stuart Axelbrooke, who does data science and text analytics. You should follow him on Twitter