改进的卷积神经网络源代码相似性度量方法

谢春丽; 蔺疆旭; 刘小洋; 张文斌; 黄军伟

doi:10.21656/1000-0887.400221

改进的卷积神经网络源代码相似性度量方法

doi: 10.21656/1000-0887.400221

江苏师范大学计算机科学与技术学院, 江苏徐州 221116

基金项目: 国家自然科学基金（61773185;61877030;61502212）; 江苏省高校青蓝工程

详细信息

作者简介:
谢春丽（1979—），女，副教授，博士(E-mail: xcl_bhb@163.com);刘小洋(1979—),男,教授，博士(通讯作者. E-mail: liuxiaoyang1979@gmail.com);张文斌（1976—），男，讲师，硕士(E-mail: zwbwen@163.com).

中图分类号: TP311
计量
- 文章访问数: 2298
- HTML全文浏览量: 483
- PDF下载量: 390
- 被引次数: 0
出版历程
- 收稿日期: 2019-07-22
- 修回日期: 2019-09-23
- 刊出日期: 2019-11-01

A Source Code Similarity Approach Based on Improved Convolutional Neural Networks

School of Computer Science & Technology, Jiangsu Normal University, Xuzhou, Jiangsu 221116, P.R.China

Funds: The National Natural Science Foundation of China（61773185;61877030;61502212）

摘要

摘要: 源代码相似性是指不同代码段功能上的相似程度，是软件工程领域一项重要的研究问题.现有的方法主要从文本、结构两方面，利用代码的统计学特征计算相似性，其最大缺点就是无法表达代码的语义特征.为解决此类问题，提出了一种融合统计信息的卷积神经网络（statistics information for code embeddingconvolutional neural networks, SICE-CNN）源代码相似性检测方法.该方法首先通过词嵌入对源代码进行信息表示，获取代码的词嵌入向量信息；其次，构建CNN训练模型学习源代码文档的嵌入表示；最后，计算源代码对的余弦相似值.实验表明，该方法和一般词嵌入方法相比提高了一定的性能，能较好地检测源代码的语义相似性.
- 深度学习 /
- 卷积神经网络 /
- 代码相似性 /
- 词嵌入
Abstract: The source code similarity refers to the functional similarity of different code segments, which touches off important research in the field of software engineering. The existing methods mainly extracted texts and structure features manually from source codes to calculate the similarity based on the statistical information in disregard of the semantic characteristics of codes. To solve this problem, a source code similarity detection method based on the CNN was proposed. First, the source code was represented through word embedding to obtain the vector information of word embedding. Second, the CNN training model was constructed to learn the embedded representation of source code documents. Finally, the cosine similarity value of source code pairs was calculated. Experiments show that, the proposed method can certainly improve the performance with respect to the semantic similarity of source codes.
- deep learning /
- convolutional neural network /
- code similarity /
- word embedding

HTML全文

参考文献(35)

[1]	KAMIYA T, KUSUMOTO S, INOUE K. CCFinder: a multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering,2002,28(7): 654-670.
[2]	BELLON S, KOSCHKE R, ANTONIOL G, et al. Comparison and evaluation of clone detection tools[J].IEEE Transactions on Software Engineering,2007,33(9): 577-591.
[3]	LIU C, CHEN C, HAN J,et al. GPLAG: detection of software plagiarism by program dependence graph analysis[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Philadelphia, PA, USA, 2006.
[4]	COSMA G, JOY M. Towards a definition of source-code plagiarism[J]. IEEE Transactions on Education,2008,51(2): 195-200.
[5]	COSMA G, JOY M. An approach to source-code plagiarism detection and investigation using latent semantic analysis[J]. IEEE Transactions on Computers,2012,61(3): 379-394.
[6]	MENS K, LOZANO A. Source Code-Based Recommendation Systems: Recommendation Systems in Software Engineering[M]. Springer, 2014: 93-130.
[7]	MCMILLAN C, POSHYVANYK D, GRECHANIK M,et al. Portfolio: searching for relevant functions and their usages in millions of lines of code[J]. ACM Transactions on Software Engineering and Methodology,2013,22(4): 1-30. DOI: 10.1145/2522920.2522930.
[8]	RAGKHITWETSAGUL C, KRINKE J, CLARK D. A comparison of code similarity analysers[J]. Empirical Software Engineering,2017,23: 2464-2517.
[9]	ROY C K, CORDY J R. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]// Proceedings of IEEE International Conference on Program Comprehension.2008: 172-181.
[10]	BAXTER I D, YAHIN A, MOURA L, et al. Clone detection using abstract syntax trees[C]//Proceedings of the Conference on Reverse Engineering.Benevento, Italy, 2006: 368-377.
[11]	CHAE D K, HA J, KIM S W,et al. Software plagiarism detection: a graph-based approach[C]//Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management.ACM, 2013: 1577-1580.
[12]	HINDLE A, BARR E T, SU Z. On the naturalness of software[C]//2012 34th International Conference on Software Engineering (ICSE).Zurich, Switzerland, 2012: 837-847.
[13]	KARAIVANOV S, RAYCHEV V, VECHEV M T. Phrase-based statisticaltranslation of programming languages[C]//Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software.Portland, Oregon, USA, 2014: 173-184.
[14]	RAYCHEV V, VECHEV M, YAHAV E. Code completion with statistical language models[C]// Proceedings of the 35th ACM Sigplan Conference on Programming Language Design and Implementation.Edinburgh, United Kingdom, 2014: 419-428.
[15]	NGUYEN A T, NGUYEN T T, NGUYEN T N. Divide-and-conquer approach for multi-phase statistical migration for source code(T)[C]// Proceedings of the IEEE/ACM International Conference on Automated Software Engineering.Lincoln, NE, USA, 2016: 585-596.
[16]	张峰逸, 彭鑫, 陈驰, 等. 基于深度学习的代码分析研究综述[J]. 计算机应用与软件, 2018,35(6): 9-17.(ZHANG Fengyi, PENG Xin, CHEN Chi, et al. Research on code analysis based on deep learning[J]. Computer Applications and Software, 2018,35(6): 9-17.(in Chinese))
[17]	陈秋远, 李善平, 鄢萌, 等. 代码克隆检测研究进展[J]. 软件学报, 2019,30(4): 962-980.(CHEN Qiuyuan, LI Shanping, YAN Meng, et al. Code clone detection: a literature review[J]. Journal of Software,2019,30(4): 962-980.(in Chinese))
[18]	TUFANO M, WATSON C, GABRIELE B, et al. Deep learning similarities from different representations of source code[C]// Proceedings of the 15th International Conference on Mining Software Repositories.New York, USA, 2018: 542-553.
[19]	HELLENDOORN V J , DEVANBU P. Are deep neural networks the best choice for modeling source code?[C]//Proceedings of the 11th Joint Meeting.Paderborn, Germany, 2017: 763-773.
[20]	HALSTEAD M H. Elements of Software Science[M]. New York: Elsevier North-Holland, 1977.
[21]	KOMONDOOR R, HORWITZ S. Using slicing to identify duplication in source code[C]// Proceedings of International Symposium on Static Analysis.Berlin, Heidelberg, 2001.
[22]	ARROYO-FERNNDEZ I, MNDEZ-CRUZ C F, SIERRA G, et al. Unsupervised sentence representations as word information series: revisiting TF-IDF[J]. Computer Speech & Language,2019,56: 107-129.
[23]	何绪飞, 艾剑良, 宋智桃. 多元数据融合在无人机结构-健康监测中的应用[J]. 应用数学和力学, 2018,〖STHZ〗 39(4): 395-402.(HE Xufei, AI Jianliang, SONG Zhitao. Multi-source data fusion for health monitoring of unmanned aerial vehicle structures[J]. Applied Mathematics and Mechanics,2018,39(4): 395-402.(in Chinese))
[24]	NGUYEN A T, NGUYEN T D, PHAN H D,et al. A deep neural network language model with contexts for source code[C]// Proceedings of IEEE International Conference on Software Analysis.Campobasso, Italy, 2018: 323-334.
[25]	OTTENSTEIN K J. An algorithmic approach to the detection and prevention of plagiarism[J]. ACM SIGCSE Bulletin,1976,8(4): 30-41.
[26]	WHITE M, TUFANO M, VENDOME C,et al. Deep learning code fragments for code clone detection[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).Singapore, 2016: 87-98.
[27]	LAM A N, NGUYEN A T, NGUYEN H A,et al. Combining deep learning with information retrieval to localize buggy files for bug reports[C]// Proceedings of 2015 30th IEEE/ACM International Conference on Automated Software Engineering(ASE).Lincoln, NE, USA, 2015: 476-481.
[28]	HUO X, THUNG F, LI M. Deep transfer bug localization[J]. IEEE Transactions on Software Engineering,2019. DOI: 10.1109/TSE.2019.2920771.
[29]	MOU L, LI G, JIN Z, et al. TBCNN: a Tree-Based Convolutional Neural Network for Programming Language Processing[M]. Eprint Arxiv, 2014.
[30]	WHITE M, TUFANO M, MARTNEZ M,et al. Sorting and transforming program repair ingredients via deep learning code similarities[C]//Proceedings of 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).Hangzhou, China, 2019: 479-490.
[31]	MIKOLOV T, SUTSKEVER I, KAI C, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems,2013,26: 3111-3119.
[32]	YE X, SHEN H, MA X, et al. From word embeddings to document similarities for improved information retrieval in software engineering[C]//Proceeding of IEEE/ACM International Conference on Software Engineering.2016.
[33]	NGUYEN T D, NGUYEN A T, PHAN H D, et al. Exploring API embedding for api usages and applications[C]// Proceedingof IEEE/ACM International Conference on Software Engineering.Buenos Aires, Argentina, 2017: 438-449.
[34]	CHEN C, XING Z, WANG X. Unsupervised software-specific morphological forms inference from informal discussions[C]// Proceeding of IEEE/ACM International Conference on Software Engineering.Buenos Aires, Argentina, 2017: 450-461.
[35]	HAO P, MOU L, GE L, et al. Building program vector representations for deep learning[C]//Proceeding of International Conference on Knowledge Science.2015: 547-553.

施引文献

资源附件(0)

访问统计

计量

文章访问数: 2298
HTML全文浏览量: 483
PDF下载量: 390
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

改进的卷积神经网络源代码相似性度量方法

doi: 10.21656/1000-0887.400221

作者简介:
谢春丽（1979—），女，副教授，博士(E-mail: xcl_bhb@163.com);刘小洋(1979—),男,教授，博士(通讯作者. E-mail: liuxiaoyang1979@gmail.com);张文斌（1976—），男，讲师，硕士(E-mail: zwbwen@163.com).

计量

A Source Code Similarity Approach Based on Improved Convolutional Neural Networks

计量

目录

留言板

改进的卷积神经网络源代码相似性度量方法

doi: 10.21656/1000-0887.400221

作者简介: 谢春丽（1979—），女，副教授，博士(E-mail: xcl_bhb@163.com);刘小洋(1979—),男,教授，博士(通讯作者. E-mail: liuxiaoyang1979@gmail.com);张文斌（1976—），男，讲师，硕士(E-mail: zwbwen@163.com).

计量

出版历程

A Source Code Similarity Approach Based on Improved Convolutional Neural Networks

计量

出版历程

目录

作者简介:
谢春丽（1979—），女，副教授，博士(E-mail: xcl_bhb@163.com);刘小洋(1979—),男,教授，博士(通讯作者. E-mail: liuxiaoyang1979@gmail.com);张文斌（1976—），男，讲师，硕士(E-mail: zwbwen@163.com).