T. Merten, Bastian Mager, Simone Bürsner, B. Paech
{"title":"将非结构化数据分类为自然语言文本和技术信息","authors":"T. Merten, Bastian Mager, Simone Bürsner, B. Paech","doi":"10.1145/2597073.2597112","DOIUrl":null,"url":null,"abstract":"Software repository data, for example in issue tracking systems, include natural language text and technical information, which includes anything from log files via code snippets to stack traces. \n However, data mining is often only interested in one of the two types e.g. in natural language text when looking at text mining. Regardless of which type is being investigated, any techniques used have to deal with noise caused by fragments of the other type i.e. methods interested in natural language have to deal with technical fragments and vice versa. \n This paper proposes an approach to classify unstructured data, e.g. development documents, into natural language text and technical information using a mixture of text heuristics and agglomerative hierarchical clustering. \n The approach was evaluated using 225 manually annotated text passages from developer emails and issue tracker data. Using white space tokenization as a basis, the overall precision of the approach is 0.84 and the recall is 0.85.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"41 1","pages":"300-303"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Classifying unstructured data into natural language text and technical information\",\"authors\":\"T. Merten, Bastian Mager, Simone Bürsner, B. Paech\",\"doi\":\"10.1145/2597073.2597112\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software repository data, for example in issue tracking systems, include natural language text and technical information, which includes anything from log files via code snippets to stack traces. \\n However, data mining is often only interested in one of the two types e.g. in natural language text when looking at text mining. Regardless of which type is being investigated, any techniques used have to deal with noise caused by fragments of the other type i.e. methods interested in natural language have to deal with technical fragments and vice versa. \\n This paper proposes an approach to classify unstructured data, e.g. development documents, into natural language text and technical information using a mixture of text heuristics and agglomerative hierarchical clustering. \\n The approach was evaluated using 225 manually annotated text passages from developer emails and issue tracker data. Using white space tokenization as a basis, the overall precision of the approach is 0.84 and the recall is 0.85.\",\"PeriodicalId\":6621,\"journal\":{\"name\":\"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)\",\"volume\":\"41 1\",\"pages\":\"300-303\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2597073.2597112\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2597073.2597112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Classifying unstructured data into natural language text and technical information
Software repository data, for example in issue tracking systems, include natural language text and technical information, which includes anything from log files via code snippets to stack traces.
However, data mining is often only interested in one of the two types e.g. in natural language text when looking at text mining. Regardless of which type is being investigated, any techniques used have to deal with noise caused by fragments of the other type i.e. methods interested in natural language have to deal with technical fragments and vice versa.
This paper proposes an approach to classify unstructured data, e.g. development documents, into natural language text and technical information using a mixture of text heuristics and agglomerative hierarchical clustering.
The approach was evaluated using 225 manually annotated text passages from developer emails and issue tracker data. Using white space tokenization as a basis, the overall precision of the approach is 0.84 and the recall is 0.85.