Auto-Comment by Deep Learning

Deep API Learning Xiaodong GU Sunghun Kim The Hong Kong University of Science and Technology Hongyu Zhang Dongmei Zhang Microsoft Research Programming is hard Unfamiliar problems Unfamiliar APIs [Robillard,2009] how to parse XML files? DocumentBuilderFactory.newInstance DocumentBuilderFactory.newDocumentBuild er

DocumentBuilder.parse Obtaining API usage sequences based on a query Obtaining API usage sequences based on a query The Proble m? Bag-of-words Lack a deep understanding of Assumption! the semantics of the query Limitations of IR-based Approaches how to convert string to int

how to convert int to string how to convert string to number static public Integer str2Int(String str) { Integer result = null; try { result = Integer.parseInt(str); } catch (Exception e) { String negativeMode = ""; if(str.indexOf('-') != -1) negativeMode = "-"; str = str.replaceAll("-", "" ); result = Integer.parseInt(negativeMode + str); } return result; }

Cannot identify semantically related words Limit #2 Cannot distinguish word ordering Limit #1 DeepAPI Learning The Semantics DocumentBuilderFactory:newInstance DocumentBuilderFactory:newDocumentBui lder DocumentBuilder:parse how to parse XML files 1.1 2.3 0.4 5.0

[] DNN Embedding Model DNN Language Model Better query understanding (recognize semantically related words and word ordering) Background RNN Recurrent Neural Network Output Layer Hidden Layer h1 w1 h2 w2

h3 w3 Input Layer parse xml file Hidden layers are recurrently used for computation This creates an internal state of the network to record dynamic temporal behavior Background RNN Encoder-Decoder A deep learning model for the sequence-tosequence learning Encoder: An RNN that encodes a sequence of words (query) into a vector Decoder: An RNN (language model) that sequentially generates a sequence of words (APIs) based on the

(query) vector Training minimize the cost function: RNN Encoder-Decoder Model for API Sequence Generation Decoder RNN FileReader BuffereReader BuffereReader BuffereReader Encoder RNN .new Output .new .read

.close y1 y2 y3 y4 y5 h1 h2 h3 h4

h5 y1 y2 y3 y4 c Hidden h1 h2 h3 Input

x1 x2 x3 Read Text File FileReader BuffereReader BuffereReader BuffereReader .new .new .read .close

Enhancing RNN Encoder-Decoder Model with API importance Different APIs have different importance for a programming task File.new FileWriter.new Logger.log FileWriter.write Weaken the unimportant APIs IDF-based weighting Regularized Cost Function System Overview Offline Training Natural Language Annotations Code Corpus

APIrelated User Query Training Instance s Trainin g API sequence s RNN Encoder Decoder Suggested API

sequences Step1 Preparing a Parallel Corpus # copy a file from an inputstream API Sequences (Java) Annotations(Engli sh) to an outputstream URL.new URL.openConnection # open a url File.new File.exists # test file exists

File.renameTo File.delete # rename a file StringBuffer.new StreanBuffer.reverse # reverse a string # Collect 442,928 Java projects from GitHub (2008-2014) Parse source files into ASTs using Eclipse JDT Extract an API sequence and an annotation for each method body (when Javadoc comment exists) Extracting API Usage Sequences Post-order traverse on each AST tree: Constructor invocation: new C() => C.new Method call: 1

2 BufferedReader reader = new BufferedReader(); 4 while((line=reader.readLine())!=null) 5 6 reader.close; Body o.m() => C.m Parameters: o1.m1(o2.m2(),o3.m3())=> C2.m2-C3.m3-C1.m1 A sequence of statements: stmt1;stmt2;,,,stmtt;=>s1-s2--st Conditional statement:

Statement Variable Declaration Type Constructor Invocation Variable While Statement Method Invocation Variable Block Statement readLine

if(stmt1){stmt2;} else{stmt3;} =>s1-s2-s3 Loop statements: BufferedReader reader reader while(stmt1){stmt2;}=>s1-s2 BufferedReader.new BufferedReader.readLine BufferedReader.close Extracting Natural Language Annotations The first sentence of a documentation comment /*** * Copies bytes from a large (over 2GB) InputStream to an OutputStream.

* This method uses the provided buffer, so there is no need to use a * BufferedInputStream. * @param input the InputStream to read from * . . . * @since 2.2 */ public static long copyLarge(final InputStream input, final OutputStream output, final byte[] buffer) throws IOException { long count = 0; int n; while (EOF != (n = input.read(buffer))) { output.write(buffer, 0, n); count += n; } return count; } MethodDefinition Javadoc

Comment Body API sequence: InputStream.read OutputStream.write Annotation: copies bytes from a large inputstream to an outputstream. Step2 Training RNN Encoder-Decoder Model Data 7,519,907 pairs Neural Network Bi-GRU, 2 hidden layers, 1,000 hidden unites Word Embedding: 120 Training Algorithm

SGD+Adadelta Batch size: 200 Hardware: Nvidia K20 GPU Evaluation RQ1: How accurate is DeepAPI for generating API usage sequences? RQ2: How accurate is DeepAPI under different parameter settings? RQ3: Do the enhanced RNN Encoder-Decoder models improve the accuracy of DeepAPI? RQ1: How accurate is DeepAPI for generating API usage sequences? Automatic Evaluation: Data set: 7,519,907 snippets with Javadoc comments Training set: 7,509,907 pairs Test Set: 10,000 pairs

Accuracy Measure BLEU The hits of n-grams of a candidate sequence to the ground truth sequence. RQ1: How accurate is DeepAPI for generating API usage sequences? Comparison Methods Code Search with Pattern Mining Code Search Lucene Summarizing API patterns UP-Miner [Wang, MSR13] SWIM [Raghothaman, ICSE16] Query-to-API Mapping Statistical Word Alignment Search API sequence using the bag of APIs Information retrieval RQ1: How accurate is DeepAPI for generating API usage sequences?

Human Evaluation: 30 API-related natural language queries: 17 from Bing search logs 13 longer queries and queries with semantic related words Accuracy Metrics: FRank: the rank of the first relevant result in the result list Relevancy Ratio: RQ1: How accurate is DeepAPI for generating API usage sequences? Examples DeepAPI Distinguishing word ordering convert int to string => Integer.toString convert string to int => Integer.parseInt Identify Semantically related words

save an image to a file => File.new ImageIO.write write an image to a file=> File.new ImageIO.write Understand longer queries copy a file and save it to your destination path play the audio clip at the specified absolute URL SWIM Partially matched sequences generate md5 hashcode=> Object.hashCode Project-specific results test file exists => File.new, File.exists, File.getName, File.new, File.delete, FileInputStream.new, Hard to understand longer queries copy a file and save it to your destination path RQ2 Accuracy Under Different Parameter Settings BLEU scores under different number of hidden units and

word dimensions RQ3 Performance of the Enhanced RNN Encoder-Decoder Models BLEU scores of different Models(%) BLEU scores under different Conclusion Apply RNN Encoder-Decoder for generating API usage sequences for a given natural language query Recognize semantically related words Recognize word ordering Future Work Explore the applications of this model to other problems. Investigate the synthesis of sample code from the generated API sequences. Thanks!

Recently Viewed Presentations

  • ASCO's Quality Training Program

    ASCO's Quality Training Program

    ASCO's Quality Training Program. Project Title: Integrated . Post-Surgical Colon Cancer Care Planning . at . the Rutgers Cancer Institute of New Jersey and the Robert Wood Johnson University Hospital
  • Sulfur Volcanism on Io, Beyond Pele

    Sulfur Volcanism on Io, Beyond Pele

    Sulfur Volcanism on Io, Beyond Pele Author: Kandis Lea Jessup Last modified by: Kandis Lea Jessup Created Date: 5/23/2005 8:50:06 AM Document presentation format: Custom Company: SwRI Other titles: Times New Roman Arial Symbol Default Design DETAILED CALCULATIONS OF THE...
  • Announcements  Last day to vote!!! New maximum is

    Announcements Last day to vote!!! New maximum is

    PowerPoint Presentation Circuit Elements - Voltage Sources Circuit Elements - Resistances Circuit Elements - Switch Circuit Elements - Measuring Devices PowerPoint Presentation PowerPoint Presentation Definitions Series Circuit Rules PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation
  • Use shuffled motifs to calculate confidence of ... - Stanford

    Use shuffled motifs to calculate confidence of ... - Stanford

    Review: Transcriptional regulation of genes. Transcription Start Site (TSS) Thousands of transcription factor-CRM interactions that control gene expression in each cell type
  • Standard Grade PE Blairgowrie High School Miss Morton

    Standard Grade PE Blairgowrie High School Miss Morton

    Review Levers Homework Name a skill. (1) Describe in detail how you were taught it. (4) Was the skill an open or closed skill and why? ... Teacher/peer feedback) External When you feel the movement and know whether it is...
  • Special Education Fiscal Auditing

    Special Education Fiscal Auditing

    Special Education . Fiscal Auditing. Roselynn Bittorf - SFS Consultant . School Financial Services Team. WASBO Accounting Conference 2019
  • The DATELINE Weighting and Grossing UP

    The DATELINE Weighting and Grossing UP

    G. Sammer, O. Roider, A. Neumann Institute for Transport Studies University Bodenkultur Vienna DATELINE Design and Application of a Travel survey for European
  • Corporate Capabilities Brief - Internal Revenue Service

    Corporate Capabilities Brief - Internal Revenue Service

    Certified 8(a)/SDB/SDVOB Capabilities Brief Millennium Corporate Office: 1421 Jefferson Davis Hwy Suite 810 Arlington, VA 22202 703-436-1343 * * * * * * * * * Corporate Profile Founded in 2004 - 8(a) Small Disadvantaged Business & Service Disabled Veteran...