Splitting millions of source code identifiers with Deep Learning

If you grab our Public Git Archive dataset with almost 180,000 Git repositories, take the latest revision of each and extract all the identifiers from them (e.g. variable, function, class names), you will end up with something close to 60 million unique strings. They include “FooBar”, “foo_bar”, and “foobar”-like concatenations of the integral identifiers or “⚛ atoms” as we sometimes call them. We’ve solved some problems which require the number of distinct atoms to be as small as possible for performance and quality considerations; those problems include topic modeling of GitHub repositories, identifier embeddings and even the recent study of files duplication on GitHub. Thus we decided to focus on reducing that number through careful splitting of the initial concatenations. The result was 64% atom vocabulary reduction.

This is a companion discussion topic for the original entry at https://blog.sourced.tech/post/idsplit/