Abstract: Natural languages like English are rich, complex, and powerful, especially in the hands of masters like Shakespeare and Avvaiyar. Most human utterances, however, are far simpler, much more repetitive and predictable, due to cognitive limitations and the exigencies of daily life. In fact, modern statistical methods can very usefully model these utterances and have enjoyed phenomenal success when applied to speech recognition, natural language translation, question-answering, and text mining and comprehension.
We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations — and thus, like natural language, is also likely to be repetitive and predictable. We then ask whether statistical language modes can a) usefully model code and b) be leveraged to aid software engineers. Using the widely adopted N-gram model, we present empirical evidence supportive of positive answers to both questions. We show that code is very repetitive; in fact, even more so than natural languages. As an example use of the model, we developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse’s built-in completion capability.
Short Bio: Earl Barr is a lecturer at the University College London. He received his M.S. (1999) and Ph.D. (2009) degrees, both in Computer Science, at the University of California at Davis. He was awarded the highly competitive I3P Fellowship from the Department of Homeland Security in 2010 and serves as a co-PI on three NSF grants and an Air Force DURIP grant. Dr. Barr’s research interests include testing and analysis, empirical software engineering, computer security, and distributed systems. His recent work focuses on testing and analysis of numerical software, automated debugging, defect analysis and prediction, and code obfuscation.