File Download
Supplementary

postgraduate thesis: Document analysis with text mining approaches in digital forensics

TitleDocument analysis with text mining approaches in digital forensics
Authors
Issue Date2017
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Yang, M. [楊敏]. (2017). Document analysis with text mining approaches in digital forensics. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractTextual evidence is important to digital investigation, which provides valuable information for criminal analysis. However, discovering valuable information from massive data is challenging. In this dissertation, we employ text mining techniques to analyze the textual data in digital forensics. Specifically, we study the following three problems: information extraction, authorship attribution and sentiment analysis. First, information extraction (IE) is to automatically extract useful information, patterns and trends from massive text data, which is increasingly important in digital investigation as the potential digital evidences have grown rapidly. We propose a two-stage information extraction framework, which may assist digital investigators in finding evidences more efficiently. Firstly, we employ a named entity recognition approach on the collected text data to extract personal names, locations and organizations. Secondly, we use the association rule mining to identify relations among the extracted named entities. We validate the effectiveness of the framework on the Enron email dataset. Experimental results show that the proposed information extraction framework can help investigators find relevant information from the text data effectively and efficiently. Second, an increasing number of criminal activities have been committed by spreading falsehoods and illegal contents on the Internet anonymously. It is difficult to trace and identify criminals in cybercrime investigation. Consequently, automatic authorship attribution of digital data becomes essential in digital investigation. Even though many achievements have been made, the traditional authorship attribution approaches are seldom used in forensic examination due to their low accuracies. In this thesis, we propose a novel authorship attribution model, which combines both the profile-based approach and the instance-based approach. Instead of asserting that a given text is written by a particular author, our approach aims to reduce the number of candidate authors and narrow the scope of suspects with high accuracy. Our experimental results demonstrate that our algorithm can successfully output a small number of candidate authors with high accuracy. Finally, people tend to express their emotions on opinion-rich websites, such as online review sites, forums, blogs and microblogging sites. Performing sentiment analysis on these online posts are important to digital investigation since the posts usually represent the senders’ emotional fingerprints. We propose the LCCT (Lexicon-based and Corpus-based, Co-Training) model for semi-supervised sentiment classification. Our method combines the lexicon-based learning with the corpus-based learning in a unified co-training framework. The proposed model is capable of incorporating both domain-specific and domain-independent knowledge. Comparing to the state-of-the-art sentiment classification methods, the LCCT model exhibits better performance on different datasets in both English and Chinese.
DegreeDoctor of Philosophy
SubjectData mining
Computer crimes - Investigation
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/240676
HKU Library Item IDb5855006

 

DC FieldValueLanguage
dc.contributor.authorYang, Min-
dc.contributor.author楊敏-
dc.date.accessioned2017-05-09T23:14:54Z-
dc.date.available2017-05-09T23:14:54Z-
dc.date.issued2017-
dc.identifier.citationYang, M. [楊敏]. (2017). Document analysis with text mining approaches in digital forensics. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/240676-
dc.description.abstractTextual evidence is important to digital investigation, which provides valuable information for criminal analysis. However, discovering valuable information from massive data is challenging. In this dissertation, we employ text mining techniques to analyze the textual data in digital forensics. Specifically, we study the following three problems: information extraction, authorship attribution and sentiment analysis. First, information extraction (IE) is to automatically extract useful information, patterns and trends from massive text data, which is increasingly important in digital investigation as the potential digital evidences have grown rapidly. We propose a two-stage information extraction framework, which may assist digital investigators in finding evidences more efficiently. Firstly, we employ a named entity recognition approach on the collected text data to extract personal names, locations and organizations. Secondly, we use the association rule mining to identify relations among the extracted named entities. We validate the effectiveness of the framework on the Enron email dataset. Experimental results show that the proposed information extraction framework can help investigators find relevant information from the text data effectively and efficiently. Second, an increasing number of criminal activities have been committed by spreading falsehoods and illegal contents on the Internet anonymously. It is difficult to trace and identify criminals in cybercrime investigation. Consequently, automatic authorship attribution of digital data becomes essential in digital investigation. Even though many achievements have been made, the traditional authorship attribution approaches are seldom used in forensic examination due to their low accuracies. In this thesis, we propose a novel authorship attribution model, which combines both the profile-based approach and the instance-based approach. Instead of asserting that a given text is written by a particular author, our approach aims to reduce the number of candidate authors and narrow the scope of suspects with high accuracy. Our experimental results demonstrate that our algorithm can successfully output a small number of candidate authors with high accuracy. Finally, people tend to express their emotions on opinion-rich websites, such as online review sites, forums, blogs and microblogging sites. Performing sentiment analysis on these online posts are important to digital investigation since the posts usually represent the senders’ emotional fingerprints. We propose the LCCT (Lexicon-based and Corpus-based, Co-Training) model for semi-supervised sentiment classification. Our method combines the lexicon-based learning with the corpus-based learning in a unified co-training framework. The proposed model is capable of incorporating both domain-specific and domain-independent knowledge. Comparing to the state-of-the-art sentiment classification methods, the LCCT model exhibits better performance on different datasets in both English and Chinese.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.subject.lcshData mining-
dc.subject.lcshComputer crimes - Investigation-
dc.titleDocument analysis with text mining approaches in digital forensics-
dc.typePG_Thesis-
dc.identifier.hkulb5855006-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats