Recent works have concluded that software is more repetitive and predictable, i.e. more natural, than English texts. These works included “simple/artificial” syntax rules in their language models. When we remove SyntaxTokens we find that code is still repetitive and predictable but only at levels slightly above English. Furthermore, previous works have compared individual Java programs to general English corpora, such as Gutenberg, which contains a historically large range of styles and subjects (e.g. Saint Augustine to Oscar Wilde). We perform an additional comparison of technical StackOverflow English discussions with source code and find that this restricted English is similarly repetitive to code. Although we find that code is less repetitive than previously thought, we suspect that API code element usage will be repetitive across software projects. For example a file is opened and closed in the same manner irrespective of domain. When we restrict our n-grams to those contained in the Java API we find that the entropy is significantly lower than the English corpora. Previous works have focused on sequential sequences of tokens. When we extract program graphs of size 2, 3, and 4 nodes we see that the abstract graph representation is much more concise and repetitive than the sequential representations of the same code. This suggests that future work should focus on statistical graph models that go beyond linear sequences of tokens. Our anonymous replication package makes our scripts and data available to future researchers and reviewers.
Wed 29 May Times are displayed in time zone: Eastern Time (US & Canada) change
11:00 - 12:30: Mining of Software Properties and PatternsPapers / Technical Track / Journal-First Papers at Place du Canada Chair(s): Julia RubinUniversity of British Columbia | |||
11:00 - 11:20 Talk | Natural Software RevisitedTechnical Track Technical Track Musfiqur RahmanConcordia University, Montreal, Canada, Dharani PalaniConcordia University, Peter RigbyConcordia University, Montreal, Canada | ||
11:20 - 11:40 Talk | Towards Automating Precision Studies of Clone Detectors Technical Track Vaibhav SainiMicrosoft, USA, Farima FarmahinifarahaniUniversity of California at Irvine, USA, Yadong LuUniversity of California at Irvine, USA, Di YangUniversity of California at Irvine, USA, Pedro MartinsUniversity of California at Irvine, USA, Hitesh SajnaniMicrosoft , Pierre BaldiUniversity of California at Irvine, USA, Crista Lopes | ||
11:40 - 11:50 Talk | Will This Clone be Short-lived?Towards a Better Understanding of the Characteristics of Short-lived ClonesJournal-First Journal-First Papers Patanamon ThongtanunamThe University of Melbourne, Weiyi ShangConcordia University, Canada, Ahmed E. HassanQueen's University | ||
11:50 - 12:00 Talk | A systematic literature review on bad smells - 5 W's: which, when, what, who, whereJournal-First Journal-First Papers Elder Vicente De Paulo SobrinhoFederal University of Triangulo Mineiro, Andrea De LuciaUniversity of Salerno, Marcelo De Almeida MaiaFederal University of Uberlandia | ||
12:00 - 12:10 Talk | Beyond Technical Aspects: How Do Community Smells Influence the Intensity of Code Smells?Journal-First Journal-First Papers Fabio PalombaUniversity of Zurich, Damian Andrew TamburriTU/e, Francesca Arcelli FontanaUniversity of Milano-Bicocca, Rocco OlivetoUniversity of Molise, Andy ZaidmanTU Delft, Alexander SerebrenikEindhoven University of Technology Pre-print | ||
12:10 - 12:20 Talk | On the Nature of Merge Conflicts: a Study of 2,731 Open Source Java Projects Hosted by GitHubJournal-First Journal-First Papers Gleiph GhiottoUFJF, Leonardo MurtaUniversidade Federal Fluminense (UFF), Marcio BarrosUNIRIO, Andre van der Hoek University of California, Irvine Pre-print | ||
12:20 - 12:30 Talk | Discussion Period Papers |