2003 Demo of COPLINK

On February 24, 2010, in RESEARCH, by RyanLA

This is part of the i2 solution. It’s pretty scary how it spiders the web looking for usage patterns to determine who MIGHT be a criminal.

Authorship Analysis: Authorship Analysis Authorship analysis attempts to determine the likelihood of a particular author having written a piece of work based on some characteristics of the author [2]. The essence of this technique, is the formation of a set of metrics, or forensics, that remain relatively constant for a large number of writings created by the same person [3]. In cyber crime research context, this technique can help determine whether a set of illegal Internet messages belong to the same user based on the person’s writing style. [2] A. Gray, P. Sallis, and S. MacDonell, “Software forensics: Extending authorship analysis techniques to computer programs,” in Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL’97), pages 1-8, 1997. [3] O. de Vel, “Mining e-mail authorship,” in Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD) 2000.

Experiment – Data Collection: Experiment – Data Collection An experiment was conducted to test the prediction accuracy of authorship analysis algorithm. 2 types of data were used: 70 email messages 3 students provided 20-30 email messages each. Messages were randomly chosen by their authors and covered a variety of topics. 153 newsgroup messages 3 popular USENET newsgroups related to software trading were selected. misc.forsale.computers.other.software misc.forsale.computers.pc-specific.software misc.forsale.computers.mac-specific.software 9 users who frequently posted messages in the 3 newsgroups were chosen. Messages posted by these users were manually checked, with the help from domain experts, to determine whether they were illegal (i.e. involving sales of pirate software). 10-30 messages per user were manually downloaded that contained illegal content.

Experiment – Feature Extraction: Experiment – Feature Extraction Previous research suggested that style markers and structural features are good indicators of an author’s style [3]. Three types of message text features were used in this experiment to determine the authorship: Style Markers (205 features) average sentence length, total number of characters, total number of punctuations, etc. Structural Features (11 features) has a greeting, has a salutation, position of reply text, number of attachments, etc. Content-specific Features (9 features, for newsgroup messages only) has a list of products, position of price (in subject, in body, in list), etc. Style markers were extracted automatically using programs. Structural and content-specific features were extracted manually. [3] O. de Vel, “Mining e-mail authorship,” in Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD) 2000.

Experiment – Classification Results: Experiment – Classification Results A Support Vector Machine classifier [4] was used to predict the authorship of the messages based on the extracted features. 10-fold cross validation method was used. Improvement in accuracy was observed with different combinations of message features. [4] C.-W. Hsu and C.-J. Lin. “A comparison on methods for multi-class support vector machines,” IEEE Transactions on Neural Networks, 13, pages 415-425, 2002.