首页 > 最新文献

Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing最新文献

英文 中文
Session details: Keynote address 会议详情:主题演讲
Miao A. Chen
{"title":"Session details: Keynote address","authors":"Miao A. Chen","doi":"10.1145/3250039","DOIUrl":"https://doi.org/10.1145/3250039","url":null,"abstract":"","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124751326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mining semantics for culturomics: towards a knowledge-based approach 为文化组挖掘语义:迈向基于知识的方法
L. Borin, Devdatt P. Dubhashi, Markus Forsberg, Richard Johansson, D. Kokkinakis, P. Nugues
The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to relevant work in linguistics and language technology. This paper provides some ideas, thoughts and first steps towards a new culturomics initiative, based this time on Swedish data, which pursues a more knowledge-based approach than previous work in this emerging field. The amount of new Swedish text produced daily and older texts being digitized in cultural heritage projects grows at an accelerating rate. These volumes of text being available in digital form have grown far beyond the capacity of human readers, leaving automated semantic processing of the texts as the only realistic option for accessing and using the information contained in them. The aim of our recently initiated research program is to advance the state of the art in language technology resources and methods for semantic processing of Big Swedish text and focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods.
通过谷歌图书数字化项目提供的大量文本数据激发了大数据文本研究的新领域。这一领域被称为文化组学,近年来吸引了越来越多学者的关注。然而,基于这些数据的初步研究因没有参考语言学和语言技术的相关工作而受到批评。本文提供了一些想法,想法和第一步,朝着一个新的文化学倡议,这一次基于瑞典的数据,追求更多的知识为基础的方法比以前的工作在这个新兴领域。瑞典每天产生的新文本和文化遗产项目中数字化的旧文本的数量正在加速增长。这些以数字形式提供的大量文本已经远远超出了人类读者的能力,使得文本的自动语义处理成为访问和使用其中包含的信息的唯一现实选择。我们最近启动的研究项目的目的是推进语言技术资源和瑞典语大文本语义处理方法的最新状态,并专注于使用基于知识和统计方法的组合从大量瑞典语文本中提取和关联信息的理论和方法的最新状态。
{"title":"Mining semantics for culturomics: towards a knowledge-based approach","authors":"L. Borin, Devdatt P. Dubhashi, Markus Forsberg, Richard Johansson, D. Kokkinakis, P. Nugues","doi":"10.1145/2513549.2513551","DOIUrl":"https://doi.org/10.1145/2513549.2513551","url":null,"abstract":"The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to relevant work in linguistics and language technology. This paper provides some ideas, thoughts and first steps towards a new culturomics initiative, based this time on Swedish data, which pursues a more knowledge-based approach than previous work in this emerging field. The amount of new Swedish text produced daily and older texts being digitized in cultural heritage projects grows at an accelerating rate. These volumes of text being available in digital form have grown far beyond the capacity of human readers, leaving automated semantic processing of the texts as the only realistic option for accessing and using the information contained in them. The aim of our recently initiated research program is to advance the state of the art in language technology resources and methods for semantic processing of Big Swedish text and focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114789389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Analyzing future communities in growing citation networks 在不断增长的引文网络中分析未来的社区
Sukhwan Jung, Aviv Segev
Citation networks contain temporal information about what researchers are interested in at a certain time. A community in such a network is built around either a renowned researcher or a common research field; either way, analyzing how the community will change in the future will give insight into the research trend in the future. The paper proposes methods to analyze how communities change over time in the citation network graph without additional external information and based on node and link prediction and community detection. Different combinations of the proposed methods are also analyzed. Experiments show that the proposed methods can identify the changes in citation communities multiple years in the future with performance differing according to the analyzed time span. Furthermore, the method is shown to produce higher performance when analyzing communities to be disbanded and to be formed in the future.
引文网络包含研究人员在特定时间感兴趣的内容的时间信息。在这样的网络中,一个社区要么是围绕着一个著名的研究者,要么是围绕一个共同的研究领域建立起来的;无论哪种方式,分析未来社区将如何变化将有助于洞察未来的研究趋势。本文提出了基于节点和链接预测以及社区检测的方法,在没有额外外部信息的情况下分析引文网络图中社区随时间的变化。并对所提出方法的不同组合进行了分析。实验表明,该方法可以识别未来数年引文群落的变化,并根据分析的时间跨度有所不同。此外,该方法在分析将要解散的社区和将要在未来形成的社区时可以产生更高的性能。
{"title":"Analyzing future communities in growing citation networks","authors":"Sukhwan Jung, Aviv Segev","doi":"10.1145/2513549.2513553","DOIUrl":"https://doi.org/10.1145/2513549.2513553","url":null,"abstract":"Citation networks contain temporal information about what researchers are interested in at a certain time. A community in such a network is built around either a renowned researcher or a common research field; either way, analyzing how the community will change in the future will give insight into the research trend in the future. The paper proposes methods to analyze how communities change over time in the citation network graph without additional external information and based on node and link prediction and community detection. Different combinations of the proposed methods are also analyzed. Experiments show that the proposed methods can identify the changes in citation communities multiple years in the future with performance differing according to the analyzed time span. Furthermore, the method is shown to produce higher performance when analyzing communities to be disbanded and to be formed in the future.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116287896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Review rating prediction based on the content and weighting strong social relation of reviewers 基于内容和加权强社会关系的评论评分预测
Bing-kun Wang, Yulin Min, Yongfeng Huang, Xing Li, Fangzhao Wu
Review rating is more helpful than review binary classification for many decision processes such as consumption decision-making, company product quality tracking and public opinion mining. In the review rating, reviewers are influenced not only by their own subjective feelings, but also by others' rating to the same product. Existing review rating prediction methods are mainly based on the content of reviews, which only consider the subjective factors of reviewers, but not consider the impact of other people in the social relations of reviewers. Based on it, we propose a review rating prediction method by incorporating the character of reviewer's social relations, as regularization constraints, into content-based methods. In addition, we further propose a method to classify the social relations of reviewers into strong social relation and ordinary social relation. For strong social relation of reviewers, we give higher weight than ordinary social relation when incorporating the two social relations into content-based methods. Experiments on two real movie review datasets demonstrate that the method of considering different social relations has better performance than the content-based methods and the method of considering social relations as a whole.
在消费决策、公司产品质量跟踪和舆情挖掘等决策过程中,评价等级比评价二元分类更有帮助。在评价等级中,评价者不仅会受到自己主观感受的影响,还会受到他人对同一产品的评价的影响。现有的评论评分预测方法主要基于评论的内容,只考虑了评论者的主观因素,而没有考虑其他人在评论者社会关系中的影响。在此基础上,提出了一种基于内容的评价评价预测方法,该方法将评价者的社会关系特征作为正则化约束纳入到评价评价方法中。此外,我们进一步提出了一种将审稿人的社会关系划分为强社会关系和普通社会关系的方法。对于评论者的强社会关系,我们在将两种社会关系结合到基于内容的方法中时给予了比普通社会关系更高的权重。在两个真实的电影评论数据集上的实验表明,考虑不同社会关系的方法比基于内容的方法和整体考虑社会关系的方法具有更好的性能。
{"title":"Review rating prediction based on the content and weighting strong social relation of reviewers","authors":"Bing-kun Wang, Yulin Min, Yongfeng Huang, Xing Li, Fangzhao Wu","doi":"10.1145/2513549.2513554","DOIUrl":"https://doi.org/10.1145/2513549.2513554","url":null,"abstract":"Review rating is more helpful than review binary classification for many decision processes such as consumption decision-making, company product quality tracking and public opinion mining. In the review rating, reviewers are influenced not only by their own subjective feelings, but also by others' rating to the same product. Existing review rating prediction methods are mainly based on the content of reviews, which only consider the subjective factors of reviewers, but not consider the impact of other people in the social relations of reviewers. Based on it, we propose a review rating prediction method by incorporating the character of reviewer's social relations, as regularization constraints, into content-based methods. In addition, we further propose a method to classify the social relations of reviewers into strong social relation and ordinary social relation. For strong social relation of reviewers, we give higher weight than ordinary social relation when incorporating the two social relations into content-based methods. Experiments on two real movie review datasets demonstrate that the method of considering different social relations has better performance than the content-based methods and the method of considering social relations as a whole.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116168233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Information fusion in taxonomic descriptions 分类描述中的信息融合
Qin Wei
Providing a single access point to an information system from multiple documents is helpful for biodiversity researchers as it is true in many fields. It not only saves the time for going back and forth from different sources but also provides the opportunity to generate new information out of the complementary information in different sources and levels of description. This paper investigates the potential of information fusion techniques in biodiversity area since the researchers in this domain desperately need information from different sources to verify their decision. In another sense, there are massive amounts of collections in this area. It is not easy or even possible for the researcher to manually collect information from different places. The proposed system contains 4 steps: Text segmentation and Taxonomic Name Identification, Organ-level and Sub-organ level Information Extraction, Relationship Identification, and Information fusion. Information fusion is based on the seven out of the twenty-four relationships in CST (Cross-document Sentence Theory). We argue that this kind of information fusion system might not only save the researchers the time for going back and forth from different sources but also provides the opportunity to generate new information out of the complementary information in different sources and levels.
从多个文献中提供一个信息系统的单一访问点对生物多样性研究人员是有帮助的,因为它在许多领域都是真实的。它不仅节省了在不同来源之间来回切换的时间,而且还提供了从不同来源和描述层次的互补信息中生成新信息的机会。由于生物多样性领域的研究人员迫切需要来自不同来源的信息来验证他们的决策,因此本文探讨了信息融合技术在生物多样性领域的潜力。从另一个意义上说,这个领域有大量的藏品。研究人员从不同的地方手动收集信息是不容易的,甚至是不可能的。该系统包括4个步骤:文本分割和分类名称识别、器官级和亚器官级信息提取、关系识别和信息融合。信息融合是基于跨文献句理论中24种关系中的7种。我们认为,这种信息融合系统不仅可以节省研究人员从不同来源来回奔波的时间,而且可以从不同来源和层次的互补信息中产生新的信息。
{"title":"Information fusion in taxonomic descriptions","authors":"Qin Wei","doi":"10.1145/2513549.2513552","DOIUrl":"https://doi.org/10.1145/2513549.2513552","url":null,"abstract":"Providing a single access point to an information system from multiple documents is helpful for biodiversity researchers as it is true in many fields. It not only saves the time for going back and forth from different sources but also provides the opportunity to generate new information out of the complementary information in different sources and levels of description. This paper investigates the potential of information fusion techniques in biodiversity area since the researchers in this domain desperately need information from different sources to verify their decision. In another sense, there are massive amounts of collections in this area. It is not easy or even possible for the researcher to manually collect information from different places. The proposed system contains 4 steps: Text segmentation and Taxonomic Name Identification, Organ-level and Sub-organ level Information Extraction, Relationship Identification, and Information fusion. Information fusion is based on the seven out of the twenty-four relationships in CST (Cross-document Sentence Theory). We argue that this kind of information fusion system might not only save the researchers the time for going back and forth from different sources but also provides the opportunity to generate new information out of the complementary information in different sources and levels.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131535046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Sentiment analysis of sentences with modalities 情态句的情感分析
Yang Liu, Xiaohui Yu, Zhongshuai Chen, Bingbing Liu
This paper is concerned with sentiment analysis of sentences with modality. Modality is a commonly occuring linguistic phenomenon. Due to its special characteristics, the sentiment borne by modality may be hard to determine by existing methods. We first present a linguistic analysis of modality, and then identify some valuable features to train a support vector machine classifier to determine the sentiment orientation of such sentences. We show experimental results on sentences with modality that are extracted from the reviews of four different products to illustrate the effectiveness of the proposed method.
本文研究了情态句的情感分析。语态是一种常见的语言现象。由于情态的特殊性,情态所承载的情感可能难以用现有的方法来确定。我们首先对情态进行语言分析,然后识别一些有价值的特征来训练支持向量机分类器来确定这些句子的情感倾向。我们展示了从四种不同产品的评论中提取的具有情态的句子的实验结果,以说明所提出方法的有效性。
{"title":"Sentiment analysis of sentences with modalities","authors":"Yang Liu, Xiaohui Yu, Zhongshuai Chen, Bingbing Liu","doi":"10.1145/2513549.2513556","DOIUrl":"https://doi.org/10.1145/2513549.2513556","url":null,"abstract":"This paper is concerned with sentiment analysis of sentences with modality. Modality is a commonly occuring linguistic phenomenon. Due to its special characteristics, the sentiment borne by modality may be hard to determine by existing methods. We first present a linguistic analysis of modality, and then identify some valuable features to train a support vector machine classifier to determine the sentiment orientation of such sentences. We show experimental results on sentences with modality that are extracted from the reviews of four different products to illustrate the effectiveness of the proposed method.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123003783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Exploiting topic tracking in real-time tweet streams 利用实时tweet流中的主题跟踪
Yihong Hong, Yue Fei, Jianwu Yang
Microblogs such as Twitter have become an increasingly popular source of real-time information.Users tend to keep up-to-date with the developments of topics they are interested in. In this paper, we present an effective real-time tweets filtering system to exploit topic tracking in social media streams. We combine background corpus with foreground corpus to handle the cold start problem. Then we build the Content Model to describe the characteristics of tweets, in which we utilize the link information to expand tweets' content aiming at enriching the semantic information of tweets, and we also analyze the influence of tweet's quality measured by a group of well-defined symbols. Moreover, the Pseudo Relevance Feedback approach triggered by a fixed-width temporal sliding window is employed to adapt our system to the alteration of topics over time. Experimental results on Tweet11 corpus indicate that our system achieves good performance in both T11SU and F-0.5 metrics, and the proposed system has better performance than the best one of TREC2012 real-time filtering pilot task.
像Twitter这样的微博已经成为越来越受欢迎的实时信息来源。用户倾向于了解他们感兴趣的主题的最新发展。在本文中,我们提出了一个有效的实时推文过滤系统来利用社交媒体流中的主题跟踪。我们将后台语料库与前台语料库相结合来解决冷启动问题。然后,我们建立了内容模型来描述推文的特征,其中我们利用链接信息来扩展推文的内容,旨在丰富推文的语义信息,并分析了一组定义良好的符号对推文质量的影响。此外,采用由固定宽度的时间滑动窗口触发的伪相关反馈方法使我们的系统适应主题随时间的变化。在Tweet11语料库上的实验结果表明,我们的系统在T11SU和F-0.5指标上都取得了良好的性能,并且该系统的性能优于TREC2012实时滤波先导任务的最佳系统。
{"title":"Exploiting topic tracking in real-time tweet streams","authors":"Yihong Hong, Yue Fei, Jianwu Yang","doi":"10.1145/2513549.2513555","DOIUrl":"https://doi.org/10.1145/2513549.2513555","url":null,"abstract":"Microblogs such as Twitter have become an increasingly popular source of real-time information.Users tend to keep up-to-date with the developments of topics they are interested in. In this paper, we present an effective real-time tweets filtering system to exploit topic tracking in social media streams. We combine background corpus with foreground corpus to handle the cold start problem. Then we build the Content Model to describe the characteristics of tweets, in which we utilize the link information to expand tweets' content aiming at enriching the semantic information of tweets, and we also analyze the influence of tweet's quality measured by a group of well-defined symbols. Moreover, the Pseudo Relevance Feedback approach triggered by a fixed-width temporal sliding window is employed to adapt our system to the alteration of topics over time. Experimental results on Tweet11 corpus indicate that our system achieves good performance in both T11SU and F-0.5 metrics, and the proposed system has better performance than the best one of TREC2012 real-time filtering pilot task.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121387393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Big data opportunities and challenges for IR, text mining and NLP 大数据对IR、文本挖掘和NLP的机遇和挑战
Beth Plale
Big Data poses challenges for text analysis and natural language processing due to its characteristics of volume, veracity, and velocity of the data. The sheer volume in terms of numbers of documents challenges traditional local repository and index systems for large-scale analysis and mining. Computation, storage and data representation must work together to provide rapid access, search, and mining of the deep knowledge in the large text collection. Text under copyright poses additional barriers to computational access, where analysis has to be separated from human consumption of the original text. Data preprocessing, in most cases, remains a daunting task for big textual data particularly data veracity is questionable due to age of original materials. Data velocity is rate of change of the data but can also be the rate at which changes and corrections are made. The HathiTrust Research Center (HTRC) provides new opportunities for IR, NLP and text mining research. HTRC is the research arm of HathiTrust, a consortium that stewards the digital library of content from research libraries around the country. With close to 11 million volumes in HathiTrust collection, HTRC aims to provide large-scale computational access and analytics to these text resources. With the goal of facilitating scholar's work, HTRC establishes a cyberinfrastructure of software, staff, and services to assist researchers and developers more easily process and mine large scale textual data effectively and efficiently. The primary users of HTRC are digital humanities, informatics, and librarians. They are of different research backgrounds and expertise and thus a variety of tools are made available to them. In the HTRC model of computing, computation moves to the data, and services grow up around the corpus to serve the research community. In this manner, the architecture is cloud-based. Moving algorithms to the data is important because the copyrighted content must be protected, however, a side benefit is that the paradigm frees scholars from worrying about managing a large corpus of data. The text analytics currently supported in HTRC is the SEASR suite of analytical algorithms (www.seasr.org). SEASR algorithms, which are written as workflows, include entity extraction, tag cloud, topic modeling, NaiveBayes, Date Entities to Similie Timeline. In this talk, I introduce the collections, architecture, and text analytics of HTRC, with a focus on the challenges of a BigData corpus and what that means for data storage, access, and large-scale computation. HTRC is building a user community to better understand and support researcher needs. It opens many exciting possibilities for the NLP, text mining, IR types of research: with so large an amount of textual data and many candidate algorithms, with support for researcher contributed algorithms, many interesting research questions emerge and many interesting results are to follow.
大数据由于其数据量、准确性和速度的特点,对文本分析和自然语言处理提出了挑战。庞大的文档数量对传统的本地存储库和索引系统进行大规模分析和挖掘提出了挑战。计算、存储和数据表示必须协同工作,以提供对大型文本集合中深度知识的快速访问、搜索和挖掘。受版权保护的文本对计算访问构成了额外的障碍,其中分析必须与原始文本的人类消费分开。在大多数情况下,数据预处理仍然是大文本数据的一项艰巨任务,特别是由于原始材料的年代久远,数据的准确性受到质疑。数据速度是数据的变化率,但也可以是进行更改和更正的速率。HathiTrust研究中心(HTRC)为IR, NLP和文本挖掘研究提供了新的机会。HTRC是HathiTrust的研究机构,HathiTrust是一个管理来自全国各地研究图书馆的数字内容图书馆的联盟。HathiTrust拥有近1100万册藏书,HTRC旨在为这些文本资源提供大规模的计算访问和分析。以促进学者的工作为目标,HTRC建立了一个由软件、人员和服务组成的网络基础设施,以帮助研究人员和开发人员更轻松、有效地处理和挖掘大规模文本数据。HTRC的主要用户是数字人文、信息学和图书馆员。他们具有不同的研究背景和专业知识,因此可以使用各种工具。在HTRC计算模型中,计算转移到数据上,服务围绕语料库发展起来,为研究社区服务。通过这种方式,架构是基于云的。将算法移到数据上是很重要的,因为受版权保护的内容必须受到保护,然而,这种范式的一个附带好处是使学者不必担心管理大量数据的问题。HTRC目前支持的文本分析是分析算法的SEASR套件(www.seasr.org)。以工作流形式编写的SEASR算法包括实体提取、标签云、主题建模、朴素贝叶斯、日期实体到相似时间轴。在这次演讲中,我将介绍HTRC的集合、架构和文本分析,重点关注大数据语料库的挑战,以及这对数据存储、访问和大规模计算意味着什么。HTRC正在建立一个用户社区,以便更好地理解和支持研究人员的需求。它为NLP、文本挖掘、IR类型的研究打开了许多令人兴奋的可能性:有了如此大量的文本数据和许多候选算法,在研究者贡献算法的支持下,许多有趣的研究问题出现了,许多有趣的结果也随之而来。
{"title":"Big data opportunities and challenges for IR, text mining and NLP","authors":"Beth Plale","doi":"10.1145/2513549.2514739","DOIUrl":"https://doi.org/10.1145/2513549.2514739","url":null,"abstract":"Big Data poses challenges for text analysis and natural language processing due to its characteristics of volume, veracity, and velocity of the data. The sheer volume in terms of numbers of documents challenges traditional local repository and index systems for large-scale analysis and mining. Computation, storage and data representation must work together to provide rapid access, search, and mining of the deep knowledge in the large text collection. Text under copyright poses additional barriers to computational access, where analysis has to be separated from human consumption of the original text. Data preprocessing, in most cases, remains a daunting task for big textual data particularly data veracity is questionable due to age of original materials. Data velocity is rate of change of the data but can also be the rate at which changes and corrections are made. The HathiTrust Research Center (HTRC) provides new opportunities for IR, NLP and text mining research. HTRC is the research arm of HathiTrust, a consortium that stewards the digital library of content from research libraries around the country. With close to 11 million volumes in HathiTrust collection, HTRC aims to provide large-scale computational access and analytics to these text resources. With the goal of facilitating scholar's work, HTRC establishes a cyberinfrastructure of software, staff, and services to assist researchers and developers more easily process and mine large scale textual data effectively and efficiently. The primary users of HTRC are digital humanities, informatics, and librarians. They are of different research backgrounds and expertise and thus a variety of tools are made available to them. In the HTRC model of computing, computation moves to the data, and services grow up around the corpus to serve the research community. In this manner, the architecture is cloud-based. Moving algorithms to the data is important because the copyrighted content must be protected, however, a side benefit is that the paradigm frees scholars from worrying about managing a large corpus of data. The text analytics currently supported in HTRC is the SEASR suite of analytical algorithms (www.seasr.org). SEASR algorithms, which are written as workflows, include entity extraction, tag cloud, topic modeling, NaiveBayes, Date Entities to Similie Timeline. In this talk, I introduce the collections, architecture, and text analytics of HTRC, with a focus on the challenges of a BigData corpus and what that means for data storage, access, and large-scale computation. HTRC is building a user community to better understand and support researcher needs. It opens many exciting possibilities for the NLP, text mining, IR types of research: with so large an amount of textual data and many candidate algorithms, with support for researcher contributed algorithms, many interesting research questions emerge and many interesting results are to follow.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124053620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Are words enough?: a study on text-based representations and retrieval models for linking pins to online shops 光说就够了吗?:图钉与网店链接的文本表示与检索模型研究
Susana Zoghbi, Ivan Vulic, Marie-Francine Moens
User-generated content offers opportunities to learn about people's interests and hobbies. We can leverage this information to help users find interesting shops and businesses find interested users. However this content is highly noisy and unstructured as posted on social media sites and blogs. In this work we evaluate different textual representations and retrieval models that aim to make sense of social media data for retail applications. Our task is to link the text of pins (from Pinterest.com) to online shops (formed by clustering Amazon.com's products). Our results show that document representations that combine latent concepts with single words yield the best performance.
用户生成的内容提供了了解人们兴趣和爱好的机会。我们可以利用这些信息帮助用户找到感兴趣的商店,帮助企业找到感兴趣的用户。然而,这些内容在社交媒体网站和博客上发布时非常嘈杂和无结构。在这项工作中,我们评估了不同的文本表示和检索模型,旨在为零售应用程序理解社交媒体数据。我们的任务是将pin的文本(来自Pinterest.com)链接到在线商店(由亚马逊的产品聚集而成)。我们的结果表明,将潜在概念与单个单词结合起来的文档表示产生了最好的性能。
{"title":"Are words enough?: a study on text-based representations and retrieval models for linking pins to online shops","authors":"Susana Zoghbi, Ivan Vulic, Marie-Francine Moens","doi":"10.1145/2513549.2513557","DOIUrl":"https://doi.org/10.1145/2513549.2513557","url":null,"abstract":"User-generated content offers opportunities to learn about people's interests and hobbies. We can leverage this information to help users find interesting shops and businesses find interested users. However this content is highly noisy and unstructured as posted on social media sites and blogs. In this work we evaluate different textual representations and retrieval models that aim to make sense of social media data for retail applications. Our task is to link the text of pins (from Pinterest.com) to online shops (formed by clustering Amazon.com's products). Our results show that document representations that combine latent concepts with single words yield the best performance.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"366 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132948234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Session details: Paper session 会议详情:纸质会议
Xiaozhong Liu
{"title":"Session details: Paper session","authors":"Xiaozhong Liu","doi":"10.1145/3250040","DOIUrl":"https://doi.org/10.1145/3250040","url":null,"abstract":"","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122485499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1