{"title":"面向非结构化网络安全智能的模块化分布式网络爬虫设计","authors":"Don Jenkins, L. Liebrock, V. Urias","doi":"10.1109/ICCST49569.2021.9717379","DOIUrl":null,"url":null,"abstract":"There are many use cases for cybersecurity related information available on the Internet. Tasks relating to natural language processing and machine learning require large amounts of structured and labeled data. However, the availability of recent data is limited due to the difficulty in its sanitization, retrieval, and labeling. Data on the Internet is generally diverse and unstructured, and storing this information in a manner that is easily usable for research and development purposes is not an intuitive task. We propose architectural considerations when developing a distributed system consisting of web crawlers, web scrapers, and various post-processing components, as well as possible implementations of these considerations. Our team developed such a system that is capable of applying structure and storing open source intelligence data from the Internet in an easily-searchable software platform called Splunk.","PeriodicalId":101539,"journal":{"name":"2021 International Carnahan Conference on Security Technology (ICCST)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Designing a Modular and Distributed Web Crawler Focused on Unstructured Cybersecurity Intelligence\",\"authors\":\"Don Jenkins, L. Liebrock, V. Urias\",\"doi\":\"10.1109/ICCST49569.2021.9717379\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There are many use cases for cybersecurity related information available on the Internet. Tasks relating to natural language processing and machine learning require large amounts of structured and labeled data. However, the availability of recent data is limited due to the difficulty in its sanitization, retrieval, and labeling. Data on the Internet is generally diverse and unstructured, and storing this information in a manner that is easily usable for research and development purposes is not an intuitive task. We propose architectural considerations when developing a distributed system consisting of web crawlers, web scrapers, and various post-processing components, as well as possible implementations of these considerations. Our team developed such a system that is capable of applying structure and storing open source intelligence data from the Internet in an easily-searchable software platform called Splunk.\",\"PeriodicalId\":101539,\"journal\":{\"name\":\"2021 International Carnahan Conference on Security Technology (ICCST)\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Carnahan Conference on Security Technology (ICCST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCST49569.2021.9717379\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Carnahan Conference on Security Technology (ICCST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCST49569.2021.9717379","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Designing a Modular and Distributed Web Crawler Focused on Unstructured Cybersecurity Intelligence
There are many use cases for cybersecurity related information available on the Internet. Tasks relating to natural language processing and machine learning require large amounts of structured and labeled data. However, the availability of recent data is limited due to the difficulty in its sanitization, retrieval, and labeling. Data on the Internet is generally diverse and unstructured, and storing this information in a manner that is easily usable for research and development purposes is not an intuitive task. We propose architectural considerations when developing a distributed system consisting of web crawlers, web scrapers, and various post-processing components, as well as possible implementations of these considerations. Our team developed such a system that is capable of applying structure and storing open source intelligence data from the Internet in an easily-searchable software platform called Splunk.