2019/11/15 When Deep Learning Met Code Search

2019-11-01T15:00:00Z session topic is When Deep Learning Met Code Search. Here is the abstract of the paper:

There have been multiple recent proposals on using deep neural networks for code search using natural language. Common across these proposals is the idea of embedding code and natural language queries, into real vectors and then using vector distance to approximate semantic correlation between code and the query. Multiple approaches exist for learning these embeddings, including unsupervised techniques, which rely only on a corpus of code examples, and supervised techniques, which use an aligned corpus of paired code and natural language descriptions. The goal of this supervision is to produce embeddings that are more similar for a query and the corresponding desired code snippet. Clearly, there are choices in whether to use supervised techniques at all, and if one does, what sort of network and training to use for supervision. This paper is the first to evaluate these choices systematically. To this end, we assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora. To explore the design space in network complexity, we also introduced a new design point that is a minimal supervision extension to an existing unsupervised technique. Our evaluation shows that: 1. adding supervision to an existing unsupervised technique can improve performance, though not necessarily by much; 2. simple networks for supervision can be more effective that more sophisticated sequence-based networks for code search; 3. while it is common to use docstrings to carry out supervision, there is a sizeable gap between the effectiveness of docstrings and a more query-appropriate supervision corpus.
The evaluation dataset is now available at arXiv:1908.09804.

:writing_hand: We take notes and prepare the discussion in a public GDoc, you are very welcome to ask questions or share your thoughts in it

:clock4: The session lasts for one hour between 2019-11-01T15:00:00Z and 2019-11-01T16:00:00Z

:world_map: The reading club happens on-line on zoom or in source{d} office in Madrid

:information_source: For more details, see our repository on GitHub

Here are the candidates for next session, please vote for the one(s) you prefer!

(we went through Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices internally so I didn’t put it up as an option in the poll!)

Thanks for voting! We’ll study When Deep Learning Met Code Search :tada:

Given that tomorrow is a bank holiday in many countries, we moved the next session to the 15th of November. See you then!