留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

改进的卷积神经网络源代码相似性度量方法

谢春丽 蔺疆旭 刘小洋 张文斌 黄军伟

谢春丽, 蔺疆旭, 刘小洋, 张文斌, 黄军伟. 改进的卷积神经网络源代码相似性度量方法[J]. 应用数学和力学, 2019, 40(11): 1235-1245. doi: 10.21656/1000-0887.400221
引用本文: 谢春丽, 蔺疆旭, 刘小洋, 张文斌, 黄军伟. 改进的卷积神经网络源代码相似性度量方法[J]. 应用数学和力学, 2019, 40(11): 1235-1245. doi: 10.21656/1000-0887.400221
XIE Chunli, LIN Jiangxu, LIU Xiaoyang, ZHANG Wenbin, HUANG Junwei. A Source Code Similarity Approach Based on Improved Convolutional Neural Networks[J]. Applied Mathematics and Mechanics, 2019, 40(11): 1235-1245. doi: 10.21656/1000-0887.400221
Citation: XIE Chunli, LIN Jiangxu, LIU Xiaoyang, ZHANG Wenbin, HUANG Junwei. A Source Code Similarity Approach Based on Improved Convolutional Neural Networks[J]. Applied Mathematics and Mechanics, 2019, 40(11): 1235-1245. doi: 10.21656/1000-0887.400221

改进的卷积神经网络源代码相似性度量方法

doi: 10.21656/1000-0887.400221
基金项目: 国家自然科学基金(61773185;61877030;61502212); 江苏省高校青蓝工程
详细信息
    作者简介:

    谢春丽(1979—),女,副教授,博士(E-mail: xcl_bhb@163.com);刘小洋(1979—),男,教授,博士(通讯作者. E-mail: liuxiaoyang1979@gmail.com);张文斌(1976—),男,讲师,硕士(E-mail: zwbwen@163.com).

  • 中图分类号: TP311

A Source Code Similarity Approach Based on Improved Convolutional Neural Networks

Funds: The National Natural Science Foundation of China(61773185;61877030;61502212)
  • 摘要: 源代码相似性是指不同代码段功能上的相似程度,是软件工程领域一项重要的研究问题.现有的方法主要从文本、结构两方面,利用代码的统计学特征计算相似性,其最大缺点就是无法表达代码的语义特征.为解决此类问题,提出了一种融合统计信息的卷积神经网络(statistics information for code embeddingconvolutional neural networks, SICE-CNN)源代码相似性检测方法.该方法首先通过词嵌入对源代码进行信息表示,获取代码的词嵌入向量信息;其次,构建CNN训练模型学习源代码文档的嵌入表示;最后,计算源代码对的余弦相似值.实验表明,该方法和一般词嵌入方法相比提高了一定的性能,能较好地检测源代码的语义相似性.
  • [1] KAMIYA T, KUSUMOTO S, INOUE K. CCFinder: a multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering,2002,28(7): 654-670.
    [2] BELLON S, KOSCHKE R, ANTONIOL G, et al. Comparison and evaluation of clone detection tools[J].IEEE Transactions on Software Engineering,2007,33(9): 577-591.
    [3] LIU C, CHEN C, HAN J,et al. GPLAG: detection of software plagiarism by program dependence graph analysis[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Philadelphia, PA, USA, 2006.
    [4] COSMA G, JOY M. Towards a definition of source-code plagiarism[J]. IEEE Transactions on Education,2008,51(2): 195-200.
    [5] COSMA G, JOY M. An approach to source-code plagiarism detection and investigation using latent semantic analysis[J]. IEEE Transactions on Computers,2012,61(3): 379-394.
    [6] MENS K, LOZANO A. Source Code-Based Recommendation Systems: Recommendation Systems in Software Engineering[M]. Springer, 2014: 93-130.
    [7] MCMILLAN C, POSHYVANYK D, GRECHANIK M,et al. Portfolio: searching for relevant functions and their usages in millions of lines of code[J]. ACM Transactions on Software Engineering and Methodology,2013,22(4): 1-30. DOI: 10.1145/2522920.2522930.
    [8] RAGKHITWETSAGUL C, KRINKE J, CLARK D. A comparison of code similarity analysers[J]. Empirical Software Engineering,2017,23: 2464-2517.
    [9] ROY C K, CORDY J R. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]// Proceedings of IEEE International Conference on Program Comprehension.2008: 172-181.
    [10] BAXTER I D, YAHIN A, MOURA L, et al. Clone detection using abstract syntax trees[C]//Proceedings of the Conference on Reverse Engineering.Benevento, Italy, 2006: 368-377.
    [11] CHAE D K, HA J, KIM S W,et al. Software plagiarism detection: a graph-based approach[C]//Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management.ACM, 2013: 1577-1580.
    [12] HINDLE A, BARR E T, SU Z. On the naturalness of software[C]//2012 34th International Conference on Software Engineering (ICSE).Zurich, Switzerland, 2012: 837-847.
    [13] KARAIVANOV S, RAYCHEV V, VECHEV M T. Phrase-based statisticaltranslation of programming languages[C]//Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software.Portland, Oregon, USA, 2014: 173-184.
    [14] RAYCHEV V, VECHEV M, YAHAV E. Code completion with statistical language models[C]// Proceedings of the 35th ACM Sigplan Conference on Programming Language Design and Implementation.Edinburgh, United Kingdom, 2014: 419-428.
    [15] NGUYEN A T, NGUYEN T T, NGUYEN T N. Divide-and-conquer approach for multi-phase statistical migration for source code(T)[C]// Proceedings of the IEEE/ACM International Conference on Automated Software Engineering.Lincoln, NE, USA, 2016: 585-596.
    [16] 张峰逸, 彭鑫, 陈驰, 等. 基于深度学习的代码分析研究综述[J]. 计算机应用与软件, 2018,35(6): 9-17.(ZHANG Fengyi, PENG Xin, CHEN Chi, et al. Research on code analysis based on deep learning[J]. Computer Applications and Software, 2018,35(6): 9-17.(in Chinese))
    [17] 陈秋远, 李善平, 鄢萌, 等. 代码克隆检测研究进展[J]. 软件学报, 2019,30(4): 962-980.(CHEN Qiuyuan, LI Shanping, YAN Meng, et al. Code clone detection: a literature review[J]. Journal of Software,2019,30(4): 962-980.(in Chinese))
    [18] TUFANO M, WATSON C, GABRIELE B, et al. Deep learning similarities from different representations of source code[C]// Proceedings of the 15th International Conference on Mining Software Repositories.New York, USA, 2018: 542-553.
    [19] HELLENDOORN V J , DEVANBU P. Are deep neural networks the best choice for modeling source code?[C]//Proceedings of the 11th Joint Meeting.Paderborn, Germany, 2017: 763-773.
    [20] HALSTEAD M H. Elements of Software Science[M]. New York: Elsevier North-Holland, 1977.
    [21] KOMONDOOR R, HORWITZ S. Using slicing to identify duplication in source code[C]// Proceedings of International Symposium on Static Analysis.Berlin, Heidelberg, 2001.
    [22] ARROYO-FERNNDEZ I, MNDEZ-CRUZ C F, SIERRA G, et al. Unsupervised sentence representations as word information series: revisiting TF-IDF[J]. Computer Speech & Language,2019,56: 107-129.
    [23] 何绪飞, 艾剑良, 宋智桃. 多元数据融合在无人机结构-健康监测中的应用[J]. 应用数学和力学, 2018,〖STHZ〗 39(4): 395-402.(HE Xufei, AI Jianliang, SONG Zhitao. Multi-source data fusion for health monitoring of unmanned aerial vehicle structures[J]. Applied Mathematics and Mechanics,2018,39(4): 395-402.(in Chinese))
    [24] NGUYEN A T, NGUYEN T D, PHAN H D,et al. A deep neural network language model with contexts for source code[C]// Proceedings of IEEE International Conference on Software Analysis.Campobasso, Italy, 2018: 323-334.
    [25] OTTENSTEIN K J. An algorithmic approach to the detection and prevention of plagiarism[J]. ACM SIGCSE Bulletin,1976,8(4): 30-41.
    [26] WHITE M, TUFANO M, VENDOME C,et al. Deep learning code fragments for code clone detection[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).Singapore, 2016: 87-98.
    [27] LAM A N, NGUYEN A T, NGUYEN H A,et al. Combining deep learning with information retrieval to localize buggy files for bug reports[C]// Proceedings of 2015 30th IEEE/ACM International Conference on Automated Software Engineering(ASE).Lincoln, NE, USA, 2015: 476-481.
    [28] HUO X, THUNG F, LI M. Deep transfer bug localization[J]. IEEE Transactions on Software Engineering,2019. DOI: 10.1109/TSE.2019.2920771.
    [29] MOU L, LI G, JIN Z, et al. TBCNN: a Tree-Based Convolutional Neural Network for Programming Language Processing[M]. Eprint Arxiv, 2014.
    [30] WHITE M, TUFANO M, MARTNEZ M,et al. Sorting and transforming program repair ingredients via deep learning code similarities[C]//Proceedings of 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).Hangzhou, China, 2019: 479-490.
    [31] MIKOLOV T, SUTSKEVER I, KAI C, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems,2013,26: 3111-3119.
    [32] YE X, SHEN H, MA X, et al. From word embeddings to document similarities for improved information retrieval in software engineering[C]//Proceeding of IEEE/ACM International Conference on Software Engineering.2016.
    [33] NGUYEN T D, NGUYEN A T, PHAN H D, et al. Exploring API embedding for api usages and applications[C]// Proceedingof IEEE/ACM International Conference on Software Engineering.Buenos Aires, Argentina, 2017: 438-449.
    [34] CHEN C, XING Z, WANG X. Unsupervised software-specific morphological forms inference from informal discussions[C]// Proceeding of IEEE/ACM International Conference on Software Engineering.Buenos Aires, Argentina, 2017: 450-461.
    [35] HAO P, MOU L, GE L, et al. Building program vector representations for deep learning[C]//Proceeding of International Conference on Knowledge Science.2015: 547-553.
  • 加载中
计量
  • 文章访问数:  1478
  • HTML全文浏览量:  279
  • PDF下载量:  387
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-07-22
  • 修回日期:  2019-09-23
  • 刊出日期:  2019-11-01

目录

    /

    返回文章
    返回