topic modeling of github repositories

Topic modeling is the machine learning subdomain which is devoted to extracting abstract “topics” from a collection of “documents”. Each “document” is represented by a bag-of-words model, that is, a set of occurring words and their frequencies. Since I am Russian, I had the introduction to topic modeling through the awesome lectures by Dr. Vorontsov at Yandex’s School for Data Analysis PDF. There exist different models to do topic modeling, the most famous (but not the best) being Latent Dirichlet Allocation (LDA). Dr. Vorontsov managed to generalize all possible bag-of-words-based topic models into Additive Regularization of Topic Models methodology (ARTM). Thus LDA becomes a subset of possible ARTMs. The really cool thing about ARTM is that Dr. Vorontsov’s PHd students developed the proof-of-concept software and it is open source: bigartm/bigartm [1, 2].


This is a companion discussion topic for the original entry at https://blog.sourced.tech/post/github_topic_modeling/