摘要

Eicient extraction of code authorship attributes is key for suc- cessful identiication. However, the extraction of such attributes is very challenging, due to various programming language speciics, the limited number of available code samples per author, and the average code lines per ile, among others. To this end, this work proposes a Deep Learning-based Code Authorship Identiication System (DL-CAIS) for code authorship attribution that facilitates large-scale, language-oblivious, and obfuscation-resilient code au- thorship identiication. The deep learning architecture adopted in this work includes TF-IDF-based deep representation using multi- ple Recurrent Neural Network (RNN) layers and fully-connected layers dedicated to authorship attribution learning. The deep repre- sentation then feeds into a random forest classiier for scalability to de-anonymize the author. Comprehensive experiments are conducted to evaluate DL-CAIS over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1987 public repositories on GitHub. The results of our work show the high accuracy despite requiring a smaller number of iles per author. Namely, we achieve an accuracy of 96% when experimenting with 1,600 authors for GCJ, and 94.38% for the real-world dataset for 745 C programmers. Our system also allows us to identify 8,903 authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Moreover, our technique is resilient to language-speciics, and thus it can identify authors of four pro- gramming languages (e.g., C, C++, Java, and Python), and authors writing in mixed languages (e.g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors.

背景

  1. On the one hand, code authorship identiication poses a privacy risk for programmers who wish to remain anonymous, including contributors to open-source projects, activists, and programmers who conduct programming activities on the side. Thus, in turn, this makes code authors identiication a de-anonymization problem.
  2. On the other hand, code authorship identiication is useful for software forensics and security analysts, especially for identifying malicious code (such as malware) pro- grammers; e.g., where such programmers could leave source code in a compromised system for compilation, or where features of programmers could be extracted from decompiled binaries. More- over, authorship identiication of source code is helpful with plagia- rism detection [15], authorship disputes [49], copyright infringe- ment [26], and code integrity investigations [38].
  3. compare:
  4. [20] showed that a simple reverse engineering process of binary iles can generate a pseudo-code that can be treated as a source code for code authorship identiication.

limitations

  1. (i) most software features used in the literature for author identiication are not directly applicable to another language; features extracted in Java cannot be directly used as features in C or in Python for identifying the same author,
  2. (ii) techniques used for extracting code authorship features do not scale well for a large set of authors (see section 2).
  3. (iii) the extracted features are usually large and not all of them are relevant to the identiication task, necessitating an additional procedure for feature evaluation and selection [22].
  4. the deep learning architecture alone does not accurately identify programmers (86.2% accuracy for 1000 programmers).
  • 多个程序员合作开发一个项目
  • 作者身份混淆攻击
  • 进行训练的样本不能过小
  • 代码重用的影响,当大量代码中插入小量恶意代码时无法识别真正的恶意代码作者

目的

  • 检测代码作者身份

方法

  • ecurrent neural network (RNN)
  • TF-IDF (Term Frequency-Inverse Doc- ument Frequency) that is already a well-known tool for textual data analysis
  • The basic idea of TF-IDF is to evaluate the importance of terms in a document in a corpus, where the importance of a term is proportional to the frequency of the term in a document.
  • Random Forest Classiier (RFC) is utilized as a classiier of a TF-IDF-based deep representation extracted by RNN. This approach allows us to utilize both deep learning’s good feature extraction capability and RFC’s large scale classiication capability.
  • The quality of extracted au- thorship attributes signiicantly afects the identiication accuracy and the extent to which the proposed method can scale in terms of the number of authors.
  • The techniques used in this paper are LSTM (Long Short-Term Memory) and Gated Recurrent Units (GRU) that are sorts of Recurrent Neural Network (RNN) among various Deep Neural Networks (DNN).
  • we used the Principal Components Analysis (PCA). The PCA is a statistical tool that is widely used as a visualization technique that relects the diference in observations of multidimen- sional data for the purpose of simplifying further analysis
  • In Figure 2(a), code iles are presented with the initial TF-IDF features, which are insuicient to draw a decision boundary for all programmers. In Figure 2(b), however, the deep representations have increased the margin for decision boundary so distinguishing programmers has become eas- ier. This visualization of the representations space (TF-IDF features and deep representations) illustrates the quality of representations obtained using the deep learning technique.
  • Our authorship identiier is built by feeding a TF-IDF-based deep representation extracted by RNN and then classifying the representation by RFC. This hybrid approach allows us to take advantage of both deep representation’s distinguishing attribute extraction capability and RFC’s large scale classiication capability.
    1. preprocessing,
    2. representation through learning
    3. and classiication
  • Preprocessing. The first phase starts with data preprocessing to handle code samples and generate initial representations. The initial representations of code samples are later fed into a deep learning architecture to learn more distinctive features. Finally, deep repre- sentations of code authorship attributions are used to construct a robust random forest model. Figure 3 illustrates the overall structure of our proposed system. In the irst phase, a straightforward mech- anism is used to represent source code iles based on a weighting scheme commonly used in information retrieval.
  • Representation by Learning. This phase includes learning deep representations of authorship from less distinctive ones. Those representations are learned using an architecture with multiple RNN layers and fully-connected layers.
  • Classiication. After training the deep architecture, the resulting representations are used to construct a random forest classiier with 300 trees grown to the maximum extent.
  • The RNN layers in our deep learning architecture included a dropout regularization technique [44]. In essence, this technique randomly and temporally excludes a number of units on the forward pass and weight updates on the backward pass during the training process. The dropout regularization technique has been shown to enable the neural network to reach better generalization capabilities [44].
  • RNN layers with dropout keep-rate of 0.6, followed by three fully-connected layers with ReLU activation. Each of the fully-connected layers has 1024 units except the last layer, which has 800 units representing the dimensional- ity of code authorship features for a given input file.
  • supervised learning approach
  • Thus, we use the output of layer Lk −1 (where the Lk is the softmax layer) to be the deep representations of code authorship features.
  • TensorFlow’s Adaptive Moment estimation (Adam) [32] with a learning rate of 10−4, and without reducing the learning rate over time.
  • 300 decision trees

特征

  1. Source code authorship identiication is the process of code writer identiication by associating a programmer to a given code based on the programmer’s distinctive stylometric features.
  2. Code authorship identiication relies on extracting features from source code that a programmer produces based on the program- mer’s preferences in structuring codes and naming variables.
  3. First, programming łstylež of program- mers continuously evolves as a result of their education, their ex- perience, their use of certain software engineering paradigms, and their work environment [17].
  4. Second, the programming style of programmers varies from language to another due to external con- straints placed by managers, tools, or even languages. Third, while it is sometimes possible to obtain the source code of programs, sometimes it is not, and the source code is occasionally obfuscated by automatic tools, preventing their recognition.
  5. 8,903 programmers high accuracy (92.3%)
  6. Figure 4(a) shows the impact of feature selection, using four diferent approaches, on the accuracy of our approach using TF-IDF features in identifying code authors. In this experiment, we use 1,000 features to identify authors in a 250 C++ programmers experiment. The results demonstrate a substantial accuracy rate (of over 96%) for the given problem size. In Figure 4(b), we demonstrate the impact of the number of the selected TF-IDF features on the accuracy of the classiier. We note that the accuracy increases up to some value of the number of features after which it decays quickly. The accuracy, even with the smallest number of features, is relatively high.

训练服务器性能

24 cores, one GeForce GTX TITAN X GPU, and 256GB of memory

数据集

Our experiments cover the entire Google Code Jam dataset from 2008 to 2016

The Google Code Jam (GCJ) is an international programming com- petition run by Google since 2008 [1]. At GCJ, programmers from all over the world use several programming languages and develop- ment environments to solve programming problems over multiple rounds. Each round of the competition involves writing a program to solve a small number of problemsÐthree to six, within a ixed amount of time. We evaluate our approach on the source code of solutions to programming problems from GCJ. The most commonly used programming languages at GCJ are C++, Java, Python, and C, in order. Each of those languages has a suicient number of source code samples for each programmer, thus we use them for our evaluation. For a large-scale evaluation, we used the contest code from 2008 to 2016, with general statistics as shown in Table 2. The table shows the number of iles per author across years, with the total number of authors per programming language and the average ile size (lines of code, LoC). For evaluation, we create the following three dataset views (Tables 2ś4):
(1) Dataset 1: includes iles across all years from 2008 to 2016 in a łcross-yearsž view, as shown in Table 2.
(2) Dataset 2: consists of code iles for participants drawn from
2015 and 2016 competitions for four programming languages, as shown in Table 3.
(3) Dataset 3: consists of programmers who wrote in more than
one language (i.e., Java-C++, C-C++, and Python-C++) as shown in Table 4.
Number of Files. In [19], the use of nine files per programmer for
accuracy is recommended. Our approach provides as goodśor even betterśaccuracy with only seven files, as shown in ğ5.3.

效果

  1. language kinds
  2. coding time
  3. training file number
  4. training author number
  5. Obfuscated

Large-Scale and Language-Oblivious Code Authorship Identification相关推荐

  1. Introducing DataFrames in Apache Spark for Large Scale Data Science(中英双语)

    文章标题 Introducing DataFrames in Apache Spark for Large Scale Data Science 一个用于大规模数据科学的API--DataFrame ...

  2. Paper之BigGAN:《Large Scale Gan Training For High Fidelity Natural Image Synthesis》翻译与解读

    Paper之BigGAN:<Large Scale Gan Training For High Fidelity Natural Image Synthesis>翻译与解读 目录 效果 1 ...

  3. 【多标签文本分类】Large Scale Multi-label Text Classification with Semantic Word Vectors

    ·阅读摘要:   本文提出了利用词向量+CNN/词向量+GRU来解决大规模数据下的多标签文本分类问题.   [1] Large Scale Multi-label Text Classificatio ...

  4. Machine Learning week 10 quiz: Large Scale Machine Learning

    Large Scale Machine Learning 5 试题 1. Suppose you are training a logistic regression classifier using ...

  5. Paper之BigGAN:ICLR 2019最新论文《LARGE SCALE GAN TRAINING FOR HIGH FIDELITY NATURAL IMAGE SYNTHESIS》(未完待续)

    Paper之BigGAN:ICLR 2019最新论文<LARGE SCALE GAN TRAINING FOR HIGH FIDELITY NATURAL IMAGE SYNTHESIS> ...

  6. 【原】Coursera—Andrew Ng机器学习—课程笔记 Lecture 17—Large Scale Machine Learning 大规模机器学习...

    Lecture17 Large Scale Machine Learning大规模机器学习 17.1 大型数据集的学习 Learning With Large Datasets 如果有一个低方差的模型 ...

  7. Large scale GAN training for high fidelity natural image synthesis解读

    <Large scale GANtraining for high fidelity natural image synthesis>这篇文章对训练大规模生成对抗网络进行了实验和理论分析, ...

  8. 【SFace】《SFace: An Efficient Network for Face Detection in Large Scale Variations》

    arXiv-2018 文章目录 1 Background and Motivation 2 Advantages / Contributions 3 Method 3.1 Feature Pyrami ...

  9. 【读书笔记】NeurIPS2018的两篇文章:The Tradeoffs of Large Scale Learning和Neural Ordinary Differential Equations

    今天看了 NeurIPS 2018 上的两篇文章,一篇是获得 best paper 的 Neural Ordinary Differential Equations (陈天奇的文章),一篇是获经典论文 ...

最新文章

  1. 回流,重绘的概念,以及他们的区别?
  2. Class.isAssignableFrom(Class clz)与instanceof与Class.isInstance(Object obj) 的区别和联系
  3. 深入理解Kafka(4)-主题与分区
  4. 摄像头分辨率怎么调整_手机摄像头测试:细数手机摄像头由单摄到多摄有哪些变化...
  5. 国内外独立IP行情及网站用独立IP优势面面观
  6. 软件行业资本论[前言,第一节]
  7. DataNode之BlockSender分析
  8. “npm ERR! code ELIFECYCLE npm ERR! errno 1”问题
  9. win7上安装mysql数据库_mysql-windows系统安装mysql数据库
  10. Java Json xml 格式之间转换(一)
  11. delphi VMP加壳 自检保护和硬件ID获取
  12. 【雕爷学编程】Arduino动手做(71)---红外遥控扩展模块
  13. java基本数据类型存放在哪?
  14. Cubieboard1 引导安装 Debian 系统
  15. 西南交大计算机应用基础 第2次作业 主观题目,西南交大网络教育2011-2012学年计算机应用基础第四次作业(主观题)...
  16. 19-Python基础知识学习-----迭代器与生成器
  17. 云效搭建流水线实现自动化构建部署
  18. 使用Weka进行数据可视化
  19. DEVOPS:统一DEV、OPS和QA
  20. 司新颖:当番茄遭遇鸡蛋

热门文章

  1. 关于双硬盘电脑装Windows纯净系统的问题(已解决)
  2. 基于SIMPLE的Navier-Stokes方程解算器matlab代码
  3. mysql的集群和分布式区别_集群和分布式的区别
  4. 重庆市江津区行政边界地图JSON文件精确到城镇街道-geojson-20211208
  5. 杰理-K歌宝-内置MIC的效果其实还不错
  6. win7打开谷歌浏览器死机办法解决
  7. 开源项目推荐:OpenGL之开源库OpenSceneGraph
  8. 电脑店能安装mysql_用U盘给台式机安装Linux系统
  9. php安装软件下载,PHP加密工具下载 IonCube(PHP加密软件) v10.2.0 免费安装版 下载-脚本之家...
  10. 互联网产品设计进阶(5)关于用户体验(边思考边完善)