Data最新文献_第5页

An Ontology-based Collaborative Business Intelligence Framework 基于本体的协同商业智能框架

IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-07-04 DOI: 10.48550/arXiv.2307.01568

M. Fahad, J. Darmont

Business Intelligence constitutes a set of methodologies and tools aiming at querying, reporting, on-line analytic processing (OLAP), generating alerts, performing business analytics, etc. When in need to perform these tasks collectively by different collaborators, we need a Collaborative Business Intelligence (CBI) platform. CBI plays a significant role in targeting a common goal among various companies, but it requires them to connect, organize and coordinate with each other to share opportunities, respecting their own autonomy and heterogeneity. This paper presents a CBI platform that hat democratizes data by allowing BI users to easily connect, share and visualize data among collaborators, obtain actionable answers by collaborative analysis, investigate and make collaborative decisions, and also store the analyses along graphical diagrams and charts in a collaborative ontology knowledge base. Our CBI framework supports and assists information sharing, collaborative decision-making and annotation management beyond the boundaries of individuals and enterprises.

商业智能包括一组方法和工具，用于查询、报告、在线分析处理(OLAP)、生成警报、执行业务分析等。当需要由不同的协作者共同执行这些任务时，我们需要一个协作式商业智能(CBI)平台。CBI在各公司之间寻找共同目标方面发挥了重要作用，但它要求企业之间相互联系、组织和协调，共享机会，尊重企业自身的自主性和异质性。本文提出了一个CBI平台，该平台允许BI用户在协作者之间轻松连接、共享和可视化数据，通过协作分析获得可操作的答案，调查和做出协作决策，并将分析结果以图形图和图表的形式存储在协作本体知识库中，从而使数据民主化。我们的CBI框架支持和协助超越个人和企业边界的信息共享、协同决策和注释管理。

引用次数: 0

The Application of Affective Measures in Text-based Emotion Aware Recommender Systems 情感度量在基于文本的情感感知推荐系统中的应用

IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-05-04 DOI: 10.48550/arXiv.2305.04796

John Kalung Leung, Igor Griva, W. Kennedy, J. Kinser, Sohyun Park, Seoyoon Lee

This paper presents an innovative approach to address the problems researchers face in Emotion Aware Recommender Systems (EARS): the difficulty and cumbersome collecting voluminously good quality emotion-tagged datasets and an effective way to protect users' emotional data privacy. Without enough good-quality emotion-tagged datasets, researchers cannot conduct repeatable affective computing research in EARS that generates personalized recommendations based on users' emotional preferences. Similarly, if we fail to fully protect users' emotional data privacy, users could resist engaging with EARS services. This paper introduced a method that detects affective features in subjective passages using the Generative Pre-trained Transformer Technology, forming the basis of the Affective Index and Affective Index Indicator (AII). Eliminate the need for users to build an affective feature detection mechanism. The paper advocates for a separation of responsibility approach where users protect their emotional profile data while EARS service providers refrain from retaining or storing it. Service providers can update users' Affective Indices in memory without saving their privacy data, providing Affective Aware recommendations without compromising user privacy. This paper offers a solution to the subjectivity and variability of emotions, data privacy concerns, and evaluation metrics and benchmarks, paving the way for future EARS research.

本文提出了一种创新的方法来解决研究人员在情感感知推荐系统（EARS）中面临的问题：收集大量高质量的情感标记数据集的困难和繁琐，以及保护用户情感数据隐私的有效方法。如果没有足够高质量的情感标签数据集，研究人员就无法在基于用户情感偏好生成个性化推荐的EARS中进行可重复的情感计算研究。同样，如果我们不能充分保护用户的情感数据隐私，用户可能会拒绝使用EARS服务。本文介绍了一种利用生成预训练变换技术检测主观段落情感特征的方法，形成了情感指数和情感指数指标（AII）的基础。消除了用户构建情感特征检测机制的需要。该论文主张采用责任分离的方法，用户保护自己的情感档案数据，而EARS服务提供商则不保留或存储这些数据。服务提供商可以在不保存用户隐私数据的情况下更新用户的情感指数，在不损害用户隐私的情况下提供情感感知建议。本文为情绪的主观性和可变性、数据隐私问题以及评估指标和基准提供了一个解决方案，为未来的EARS研究铺平了道路。

{"title":"The Application of Affective Measures in Text-based Emotion Aware Recommender Systems","authors":"John Kalung Leung, Igor Griva, W. Kennedy, J. Kinser, Sohyun Park, Seoyoon Lee","doi":"10.48550/arXiv.2305.04796","DOIUrl":"https://doi.org/10.48550/arXiv.2305.04796","url":null,"abstract":"This paper presents an innovative approach to address the problems researchers face in Emotion Aware Recommender Systems (EARS): the difficulty and cumbersome collecting voluminously good quality emotion-tagged datasets and an effective way to protect users' emotional data privacy. Without enough good-quality emotion-tagged datasets, researchers cannot conduct repeatable affective computing research in EARS that generates personalized recommendations based on users' emotional preferences. Similarly, if we fail to fully protect users' emotional data privacy, users could resist engaging with EARS services. This paper introduced a method that detects affective features in subjective passages using the Generative Pre-trained Transformer Technology, forming the basis of the Affective Index and Affective Index Indicator (AII). Eliminate the need for users to build an affective feature detection mechanism. The paper advocates for a separation of responsibility approach where users protect their emotional profile data while EARS service providers refrain from retaining or storing it. Service providers can update users' Affective Indices in memory without saving their privacy data, providing Affective Aware recommendations without compromising user privacy. This paper offers a solution to the subjectivity and variability of emotions, data privacy concerns, and evaluation metrics and benchmarks, paving the way for future EARS research.","PeriodicalId":36824,"journal":{"name":"Data","volume":"1 1","pages":"590-597"},"PeriodicalIF":2.6,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43494981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Benchmarking Automated Machine Learning Methods for Price Forecasting Applications 自动机器学习方法在价格预测应用中的基准测试

IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-04-28 DOI: 10.48550/arXiv.2304.14735

Horst Stühler, M. Zöller, Dennis Klau, A. B. Bedrikow, Christian Tutschku

Price forecasting for used construction equipment is a challenging task due to spatial and temporal price fluctuations. It is thus of high interest to automate the forecasting process based on current market data. Even though applying machine learning (ML) to these data represents a promising approach to predict the residual value of certain tools, it is hard to implement for small and medium-sized enterprises due to their insufficient ML expertise. To this end, we demonstrate the possibility of substituting manually created ML pipelines with automated machine learning (AutoML) solutions, which automatically generate the underlying pipelines. We combine AutoML methods with the domain knowledge of the companies. Based on the CRISP-DM process, we split the manual ML pipeline into a machine learning and non-machine learning part. To take all complex industrial requirements into account and to demonstrate the applicability of our new approach, we designed a novel metric named method evaluation score, which incorporates the most important technical and non-technical metrics for quality and usability. Based on this metric, we show in a case study for the industrial use case of price forecasting, that domain knowledge combined with AutoML can weaken the dependence on ML experts for innovative small and medium-sized enterprises which are interested in conducting such solutions.

由于价格的空间和时间波动，二手建筑设备的价格预测是一项具有挑战性的任务。因此，基于当前市场数据实现预测过程的自动化具有很高的兴趣。尽管将机器学习（ML）应用于这些数据是预测某些工具剩余价值的一种很有前途的方法，但由于中小企业缺乏ML专业知识，因此很难实现。为此，我们展示了用自动机器学习（AutoML）解决方案取代手动创建的ML管道的可能性，该解决方案自动生成底层管道。我们将AutoML方法与公司的领域知识相结合。基于CRISP-DM过程，我们将手动ML管道划分为机器学习和非机器学习部分。为了考虑到所有复杂的工业需求，并证明我们新方法的适用性，我们设计了一个名为方法评估分数的新指标，其中包含了质量和可用性方面最重要的技术和非技术指标。基于这一指标，我们在价格预测的工业用例的案例研究中表明，领域知识与AutoML相结合，可以削弱对有兴趣进行此类解决方案的创新型中小企业对ML专家的依赖。

{"title":"Benchmarking Automated Machine Learning Methods for Price Forecasting Applications","authors":"Horst Stühler, M. Zöller, Dennis Klau, A. B. Bedrikow, Christian Tutschku","doi":"10.48550/arXiv.2304.14735","DOIUrl":"https://doi.org/10.48550/arXiv.2304.14735","url":null,"abstract":"Price forecasting for used construction equipment is a challenging task due to spatial and temporal price fluctuations. It is thus of high interest to automate the forecasting process based on current market data. Even though applying machine learning (ML) to these data represents a promising approach to predict the residual value of certain tools, it is hard to implement for small and medium-sized enterprises due to their insufficient ML expertise. To this end, we demonstrate the possibility of substituting manually created ML pipelines with automated machine learning (AutoML) solutions, which automatically generate the underlying pipelines. We combine AutoML methods with the domain knowledge of the companies. Based on the CRISP-DM process, we split the manual ML pipeline into a machine learning and non-machine learning part. To take all complex industrial requirements into account and to demonstrate the applicability of our new approach, we designed a novel metric named method evaluation score, which incorporates the most important technical and non-technical metrics for quality and usability. Based on this metric, we show in a case study for the industrial use case of price forecasting, that domain knowledge combined with AutoML can weaken the dependence on ML experts for innovative small and medium-sized enterprises which are interested in conducting such solutions.","PeriodicalId":36824,"journal":{"name":"Data","volume":"1 1","pages":"30-39"},"PeriodicalIF":2.6,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42864663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

QTrail-DB: A Query Processing Engine for Imperfect Databases with Evolving Qualities QTrail-DB:一个具有进化性质的不完全数据库查询处理引擎

IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-03-12 DOI: 10.48550/arXiv.2303.06720

Maha Asiri, M. Eltabakh

Imperfect databases are very common in many applications due to various reasons ranging from data-entry errors, transmission or integration errors, and wrong instruments' readings, to faulty experimental setups leading to incorrect results. The management and query processing of imperfect databases is a very challenging problem as it requires incorporating the data's qualities within the database engine. Even more challenging, the qualities are typically not static and may evolve over time. Unfortunately, most of the state-of-art techniques deal with the data quality problem as an offline task that is in total isolation of the query processing engine (carried out outside the DBMS). Hence, end-users will receive the queries' results with no clue on whether or not the results can be trusted for further analysis and decision making. In this paper, we propose the it"QTrail-DB"system that fundamentally extends the standard DBMSs to support imperfect databases with evolving qualities. QTrail-DB introduces a new quality model based on the new concept of"Quality Trails", which captures the evolution of the data's qualities over time. QTrail-DB extends the relational data model to incorporate the quality trails within the database system. We propose a new query algebra, called"QTrail Algebra", that enables seamless and transparent propagation and derivations of the data's qualities within a query pipeline. As a result, a query's answer will be automatically annotated with quality-related information at the tuple level. QTrail-DB propagation model leverages the thoroughly-studied propagation semantics present in the DB provenance and lineage tracking literature, and thus there is no need for developing a new query optimizer. QTrail-DB is developed within PostgreSQL and experimentally evaluated using real-world datasets to demonstrate its efficiency and practicality.

由于各种原因，从数据输入错误、传输或集成错误、仪器读数错误，到导致错误结果的错误实验设置，不完善的数据库在许多应用中非常常见。不完善数据库的管理和查询处理是一个非常具有挑战性的问题，因为它需要将数据的质量纳入数据库引擎中。更具挑战性的是，这些品质通常不是一成不变的，可能会随着时间的推移而演变。不幸的是，大多数现有技术都将数据质量问题作为离线任务来处理，该任务与查询处理引擎完全隔离（在DBMS之外执行）。因此，最终用户将收到查询的结果，而不知道这些结果是否可以用于进一步的分析和决策。在本文中，我们提出了it“QTrail DB”系统，该系统从根本上扩展了标准DBMS，以支持具有不断发展的质量的不完美数据库。QTrail DB引入了一个基于“质量轨迹”新概念的新质量模型，该模型捕捉了数据质量随时间的演变。QTrail DB扩展了关系数据模型，将质量跟踪纳入数据库系统。我们提出了一种新的查询代数，称为“QTrail代数”，它能够在查询管道中无缝透明地传播和派生数据的质量。因此，查询的答案将在元组级别自动注释有与质量相关的信息。QTrail数据库传播模型利用了数据库出处和谱系跟踪文献中深入研究的传播语义，因此无需开发新的查询优化器。QTrail数据库是在PostgreSQL中开发的，并使用真实世界的数据集进行了实验评估，以证明其效率和实用性。

{"title":"QTrail-DB: A Query Processing Engine for Imperfect Databases with Evolving Qualities","authors":"Maha Asiri, M. Eltabakh","doi":"10.48550/arXiv.2303.06720","DOIUrl":"https://doi.org/10.48550/arXiv.2303.06720","url":null,"abstract":"Imperfect databases are very common in many applications due to various reasons ranging from data-entry errors, transmission or integration errors, and wrong instruments' readings, to faulty experimental setups leading to incorrect results. The management and query processing of imperfect databases is a very challenging problem as it requires incorporating the data's qualities within the database engine. Even more challenging, the qualities are typically not static and may evolve over time. Unfortunately, most of the state-of-art techniques deal with the data quality problem as an offline task that is in total isolation of the query processing engine (carried out outside the DBMS). Hence, end-users will receive the queries' results with no clue on whether or not the results can be trusted for further analysis and decision making. In this paper, we propose the it\"QTrail-DB\"system that fundamentally extends the standard DBMSs to support imperfect databases with evolving qualities. QTrail-DB introduces a new quality model based on the new concept of\"Quality Trails\", which captures the evolution of the data's qualities over time. QTrail-DB extends the relational data model to incorporate the quality trails within the database system. We propose a new query algebra, called\"QTrail Algebra\", that enables seamless and transparent propagation and derivations of the data's qualities within a query pipeline. As a result, a query's answer will be automatically annotated with quality-related information at the tuple level. QTrail-DB propagation model leverages the thoroughly-studied propagation semantics present in the DB provenance and lineage tracking literature, and thus there is no need for developing a new query optimizer. QTrail-DB is developed within PostgreSQL and experimentally evaluated using real-world datasets to demonstrate its efficiency and practicality.","PeriodicalId":36824,"journal":{"name":"Data","volume":"1 1","pages":"295-302"},"PeriodicalIF":2.6,"publicationDate":"2023-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49395702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analysis of Government Policy Sentiment Regarding Vacation during the COVID-19 Pandemic Using the Bidirectional Encoder Representation from Transformers (BERT) 基于变压器双向编码器表示(BERT)的COVID-19大流行期间政府度假政策情绪分析

Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-02-23 DOI: 10.3390/data8030046

Intan Nurma Yulita, Victor Wijaya, Rudi Rosadi, Indra Sarathan, Yusa Djuyandi, Anton Satria Prabuwono

To address the COVID-19 situation in Indonesia, the Indonesian government has adopted a number of policies. One of them is a vacation-related policy. Government measures with regard to this vacation policy have produced a wide range of viewpoints in society, which have been extensively shared on social media, including YouTube. However, there has not been any computerized system developed to date that can assess people’s social media reactions. Therefore, this paper provides a sentiment analysis application to this government policy by employing a bidirectional encoder representation from transformers (BERT) approach. The study method began with data collecting, data labeling, data preprocessing, BERT model training, and model evaluation. This study created a new dataset for this topic. The data were collected from the comments section of YouTube, and were categorized into three categories: positive, neutral, and negative. This research yielded an F-score of 84.33%. Another contribution from this study regards the methodology for processing sentiment analysis in Indonesian. In addition, the model was created as an application using the Python programming language and the Flask framework. The government can learn the extent to which the public accepts the policies that have been implemented by utilizing this research.

为应对新冠肺炎疫情，印尼政府采取了一系列政策。其中之一是与假期相关的政策。政府对这一假期政策的措施在社会上产生了广泛的观点，这些观点在包括YouTube在内的社交媒体上广泛分享。然而，迄今为止还没有开发出任何计算机化的系统来评估人们在社交媒体上的反应。因此，本文提供了一个情感分析应用于该政府政策，采用双向编码器表示从变压器(BERT)的方法。研究方法从数据收集、数据标注、数据预处理、BERT模型训练和模型评价开始。这项研究为这个主题创建了一个新的数据集。这些数据是从YouTube的评论区收集的，并被分为三个类别:积极，中立和消极。该研究的f值为84.33%。本研究的另一个贡献是处理印尼语情绪分析的方法。此外，该模型是使用Python编程语言和Flask框架作为应用程序创建的。政府可以通过这项研究了解公众对已经实施的政策的接受程度。

{"title":"Analysis of Government Policy Sentiment Regarding Vacation during the COVID-19 Pandemic Using the Bidirectional Encoder Representation from Transformers (BERT)","authors":"Intan Nurma Yulita, Victor Wijaya, Rudi Rosadi, Indra Sarathan, Yusa Djuyandi, Anton Satria Prabuwono","doi":"10.3390/data8030046","DOIUrl":"https://doi.org/10.3390/data8030046","url":null,"abstract":"To address the COVID-19 situation in Indonesia, the Indonesian government has adopted a number of policies. One of them is a vacation-related policy. Government measures with regard to this vacation policy have produced a wide range of viewpoints in society, which have been extensively shared on social media, including YouTube. However, there has not been any computerized system developed to date that can assess people’s social media reactions. Therefore, this paper provides a sentiment analysis application to this government policy by employing a bidirectional encoder representation from transformers (BERT) approach. The study method began with data collecting, data labeling, data preprocessing, BERT model training, and model evaluation. This study created a new dataset for this topic. The data were collected from the comments section of YouTube, and were categorized into three categories: positive, neutral, and negative. This research yielded an F-score of 84.33%. Another contribution from this study regards the methodology for processing sentiment analysis in Indonesian. In addition, the model was created as an application using the Python programming language and the Flask framework. The government can learn the extent to which the public accepts the policies that have been implemented by utilizing this research.","PeriodicalId":36824,"journal":{"name":"Data","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136173540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Enriching Relation Extraction with OpenIE 用OpenIE丰富关系提取

IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2022-12-19 DOI: 10.48550/arXiv.2212.09376

Alessandro Temperoni, M. Biryukov, M. Theobald

Relation extraction (RE) is a sub-discipline of information extraction (IE) which focuses on the prediction of a relational predicate from a natural-language input unit (such as a sentence, a clause, or even a short paragraph consisting of multiple sentences and/or clauses). Together with named-entity recognition (NER) and disambiguation (NED), RE forms the basis for many advanced IE tasks such as knowledge-base (KB) population and verification. In this work, we explore how recent approaches for open information extraction (OpenIE) may help to improve the task of RE by encoding structured information about the sentences' principal units, such as subjects, objects, verbal phrases, and adverbials, into various forms of vectorized (and hence unstructured) representations of the sentences. Our main conjecture is that the decomposition of long and possibly convoluted sentences into multiple smaller clauses via OpenIE even helps to fine-tune context-sensitive language models such as BERT (and its plethora of variants) for RE. Our experiments over two annotated corpora, KnowledgeNet and FewRel, demonstrate the improved accuracy of our enriched models compared to existing RE approaches. Our best results reach 92% and 71% of F1 score for KnowledgeNet and FewRel, respectively, proving the effectiveness of our approach on competitive benchmarks.

关系提取（RE）是信息提取（IE）的一个子学科，专注于从自然语言输入单元（如句子、从句，甚至由多个句子和/或从句组成的短段落）预测关系谓词。RE与命名实体识别（NER）和消歧（NED）一起构成了许多高级IE任务的基础，如知识库（KB）填充和验证。在这项工作中，我们探索了开放信息提取（OpenIE）的最新方法如何通过将有关句子主要单元（如主语、宾语、动词短语和状语）的结构化信息编码为各种形式的句子矢量化（因此也是非结构化）表示来帮助改进RE的任务。我们的主要推测是，通过OpenIE将长且可能复杂的句子分解为多个较小的子句，甚至有助于微调上下文敏感的语言模型，如RE的BERT（及其过多的变体）。我们在两个注释语料库KnowledgeNet和FewRel上的实验，证明了与现有的RE方法相比，我们的丰富模型的准确性有所提高。KnowledgeNet和FewRel的最佳结果分别达到F1分数的92%和71%，证明了我们的方法在竞争基准上的有效性。

{"title":"Enriching Relation Extraction with OpenIE","authors":"Alessandro Temperoni, M. Biryukov, M. Theobald","doi":"10.48550/arXiv.2212.09376","DOIUrl":"https://doi.org/10.48550/arXiv.2212.09376","url":null,"abstract":"Relation extraction (RE) is a sub-discipline of information extraction (IE) which focuses on the prediction of a relational predicate from a natural-language input unit (such as a sentence, a clause, or even a short paragraph consisting of multiple sentences and/or clauses). Together with named-entity recognition (NER) and disambiguation (NED), RE forms the basis for many advanced IE tasks such as knowledge-base (KB) population and verification. In this work, we explore how recent approaches for open information extraction (OpenIE) may help to improve the task of RE by encoding structured information about the sentences' principal units, such as subjects, objects, verbal phrases, and adverbials, into various forms of vectorized (and hence unstructured) representations of the sentences. Our main conjecture is that the decomposition of long and possibly convoluted sentences into multiple smaller clauses via OpenIE even helps to fine-tune context-sensitive language models such as BERT (and its plethora of variants) for RE. Our experiments over two annotated corpora, KnowledgeNet and FewRel, demonstrate the improved accuracy of our enriched models compared to existing RE approaches. Our best results reach 92% and 71% of F1 score for KnowledgeNet and FewRel, respectively, proving the effectiveness of our approach on competitive benchmarks.","PeriodicalId":36824,"journal":{"name":"Data","volume":"1 1","pages":"359-366"},"PeriodicalIF":2.6,"publicationDate":"2022-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47593248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Comparison of Automatic Labelling Approaches for Sentiment Analysis 情感分析中自动标注方法的比较

IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2022-11-05 DOI: 10.5220/0011265900003269

Sumana Biswas, Karen Young, J. Griffith

Labelling a large quantity of social media data for the task of supervised machine learning is not only time-consuming but also difficult and expensive. On the other hand, the accuracy of supervised machine learning models is strongly related to the quality of the labelled data on which they train, and automatic sentiment labelling techniques could reduce the time and cost of human labelling. We have compared three automatic sentiment labelling techniques: TextBlob, Vader, and Afinn to assign sentiments to tweets without any human assistance. We compare three scenarios: one uses training and testing datasets with existing ground truth labels; the second experiment uses automatic labels as training and testing datasets; and the third experiment uses three automatic labelling techniques to label the training dataset and uses the ground truth labels for testing. The experiments were evaluated on two Twitter datasets: SemEval-2013 (DS-1) and SemEval-2016 (DS-2). Results show that the Afinn labelling technique obtains the highest accuracy of 80.17% (DS-1) and 80.05% (DS-2) using a BiLSTM deep learning model. These findings imply that automatic text labelling could provide significant benefits, and suggest a feasible alternative to the time and cost of human labelling efforts.

为监督机器学习任务标记大量社交媒体数据不仅耗时，而且困难且昂贵。另一方面，监督机器学习模型的准确性与它们训练的标记数据的质量密切相关，而自动情绪标记技术可以减少人类标记的时间和成本。我们比较了三种自动情感标签技术：TextBlob、Vader和Afinn，在没有任何人工帮助的情况下将情感分配给推特。我们比较了三种场景：一种使用具有现有地面实况标签的训练和测试数据集；第二个实验使用自动标签作为训练和测试数据集；第三个实验使用三种自动标记技术来标记训练数据集，并使用基本事实标记进行测试。实验在两个Twitter数据集上进行了评估：SemEval-2013（DS-1）和SemEval-2016（DS-2）。结果表明，使用BiLSTM深度学习模型，Afinn标记技术获得了80.17%（DS-1）和80.05%（DS-2）的最高准确率。这些发现表明，自动文本标签可以提供显著的好处，并为人类标签工作的时间和成本提供了一个可行的替代方案。

{"title":"A Comparison of Automatic Labelling Approaches for Sentiment Analysis","authors":"Sumana Biswas, Karen Young, J. Griffith","doi":"10.5220/0011265900003269","DOIUrl":"https://doi.org/10.5220/0011265900003269","url":null,"abstract":"Labelling a large quantity of social media data for the task of supervised machine learning is not only time-consuming but also difficult and expensive. On the other hand, the accuracy of supervised machine learning models is strongly related to the quality of the labelled data on which they train, and automatic sentiment labelling techniques could reduce the time and cost of human labelling. We have compared three automatic sentiment labelling techniques: TextBlob, Vader, and Afinn to assign sentiments to tweets without any human assistance. We compare three scenarios: one uses training and testing datasets with existing ground truth labels; the second experiment uses automatic labels as training and testing datasets; and the third experiment uses three automatic labelling techniques to label the training dataset and uses the ground truth labels for testing. The experiments were evaluated on two Twitter datasets: SemEval-2013 (DS-1) and SemEval-2016 (DS-2). Results show that the Afinn labelling technique obtains the highest accuracy of 80.17% (DS-1) and 80.05% (DS-2) using a BiLSTM deep learning model. These findings imply that automatic text labelling could provide significant benefits, and suggest a feasible alternative to the time and cost of human labelling efforts.","PeriodicalId":36824,"journal":{"name":"Data","volume":"1 1","pages":"312-319"},"PeriodicalIF":2.6,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43861695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An open dataset of connected speech in aphasia with consensus ratings of auditory-perceptual features. 失语症患者连接言语的开放数据集，对听觉感知特征进行一致评级。

IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2022-11-01 Epub Date: 2022-10-30 DOI: 10.3390/data7110148

Zoe Ezzes, Sarah M Schneck, Marianne Casilio, Davida Fromm, Antje Mefford, Michael R de Riesthal, Stephen M Wilson

Purpose: Auditory-perceptual rating of connected speech in aphasia (APROCSA) involves trained listeners rating a large number of perceptual features of speech samples, and has shown promise as an approach for quantifying expressive speech and language function in individuals with aphasia. The aim of this study was to obtain consensus ratings for a diverse set of speech samples, which can then be used as training materials for learning the APROCSA system.

Method: Connected speech samples were recorded from six individuals with chronic post-stroke aphasia. A segment containing the first five minutes of participant speech was excerpted from each sample, and 27 features were rated on a five-point scale by five researchers. The researchers then discussed each feature in turn to obtain consensus ratings.

Results: Six connected speech samples are made freely available for research, education, and clinical uses. Consensus ratings are reported for each of the 27 features, for each speech sample. Discrepancies between raters were resolved through discussion, yielding consensus ratings that can be expected to be more accurate than mean ratings.

Conclusions: The dataset will provide a useful resource for scientists, students, and clinicians to learn how to evaluate aphasic speech samples with an auditory-perceptual approach.

目的：失语症中连接言语的听觉-感知评级（APROCSA）涉及到受过训练的听众对言语样本的大量感知特征进行评级，并已显示出作为量化失语症患者表达言语和语言功能的一种方法的前景。本研究的目的是获得一组不同语音样本的一致性评级，然后将其用作学习APROCSA系统的培训材料。方法：记录6例慢性脑卒中后失语症患者的连接语音样本。从每个样本中摘录一个包含参与者演讲前五分钟的片段，五名研究人员对27个特征进行了五分制评分。然后，研究人员依次讨论每个特征，以获得一致的评分。结果：六个连接的语音样本被免费提供给研究、教育和临床使用。对每个语音样本的27个特征中的每一个都报告了一致性评级。评分者之间的差异通过讨论得到了解决，得出了一致的评分，可以预期比平均评分更准确。结论：该数据集将为科学家、学生和临床医生提供有用的资源，帮助他们学习如何用听觉感知方法评估失语症语音样本。

{"title":"An open dataset of connected speech in aphasia with consensus ratings of auditory-perceptual features.","authors":"Zoe Ezzes, Sarah M Schneck, Marianne Casilio, Davida Fromm, Antje Mefford, Michael R de Riesthal, Stephen M Wilson","doi":"10.3390/data7110148","DOIUrl":"https://doi.org/10.3390/data7110148","url":null,"abstract":"Purpose: Auditory-perceptual rating of connected speech in aphasia (APROCSA) involves trained listeners rating a large number of perceptual features of speech samples, and has shown promise as an approach for quantifying expressive speech and language function in individuals with aphasia. The aim of this study was to obtain consensus ratings for a diverse set of speech samples, which can then be used as training materials for learning the APROCSA system.Method: Connected speech samples were recorded from six individuals with chronic post-stroke aphasia. A segment containing the first five minutes of participant speech was excerpted from each sample, and 27 features were rated on a five-point scale by five researchers. The researchers then discussed each feature in turn to obtain consensus ratings.Results: Six connected speech samples are made freely available for research, education, and clinical uses. Consensus ratings are reported for each of the 27 features, for each speech sample. Discrepancies between raters were resolved through discussion, yielding consensus ratings that can be expected to be more accurate than mean ratings.Conclusions: The dataset will provide a useful resource for scientists, students, and clinicians to learn how to evaluate aphasic speech samples with an auditory-perceptual approach.","PeriodicalId":36824,"journal":{"name":"Data","volume":"7 11","pages":""},"PeriodicalIF":2.6,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10617630/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71427627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Clustering Object-Centric Event Logs 群集以对象为中心的事件日志

IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2022-07-26 DOI: 10.48550/arXiv.2207.12764

A. F. Ghahfarokhi, Fatemeh Akoochekian, F. Zandkarimi, Wil M.P. van der Aalst

Process mining provides various algorithms to analyze process executions based on event data. Process discovery, the most prominent category of process mining techniques, aims to discover process models from event logs, however, it leads to spaghetti models when working with real-life data. Therefore, several clustering techniques have been proposed on top of traditional event logs (i.e., event logs with a single case notion) to reduce the complexity of process models and discover homogeneous subsets of cases. Nevertheless, in real-life processes, particularly in the context of Business-to-Business (B2B) processes, multiple objects are involved in a process. Recently, Object-Centric Event Logs (OCELs) have been introduced to capture the information of such processes, and several process discovery techniques have been developed on top of OCELs. Yet, the output of the proposed discovery techniques on real OCELs leads to more informative but also more complex models. In this paper, we propose a clustering-based approach to cluster similar objects in OCELs to simplify the obtained process models. Using a case study of a real B2B process, we demonstrate that our approach reduces the complexity of the process models and generates coherent subsets of objects which help the end-users gain insights into the process.

流程挖掘提供了各种算法来基于事件数据分析流程执行。流程发现是流程挖掘技术中最突出的一类，旨在从事件日志中发现流程模型，然而，在处理真实数据时，它会产生意大利面条模型。因此，在传统事件日志（即具有单一案例概念的事件日志）的基础上，已经提出了几种聚类技术，以降低流程模型的复杂性并发现案例的同质子集。然而，在现实生活中的流程中，特别是在企业对企业（B2B）流程的上下文中，一个流程中涉及多个对象。最近，引入了以对象为中心的事件日志（OCEL）来捕获此类进程的信息，并在OCEL的基础上开发了几种进程发现技术。然而，所提出的发现技术在真实OCEL上的输出导致了信息量更大但也更复杂的模型。在本文中，我们提出了一种基于聚类的方法来对OCEL中的相似对象进行聚类，以简化所获得的过程模型。通过对真实B2B流程的案例研究，我们证明了我们的方法降低了流程模型的复杂性，并生成了连贯的对象子集，帮助最终用户深入了解流程。

{"title":"Clustering Object-Centric Event Logs","authors":"A. F. Ghahfarokhi, Fatemeh Akoochekian, F. Zandkarimi, Wil M.P. van der Aalst","doi":"10.48550/arXiv.2207.12764","DOIUrl":"https://doi.org/10.48550/arXiv.2207.12764","url":null,"abstract":"Process mining provides various algorithms to analyze process executions based on event data. Process discovery, the most prominent category of process mining techniques, aims to discover process models from event logs, however, it leads to spaghetti models when working with real-life data. Therefore, several clustering techniques have been proposed on top of traditional event logs (i.e., event logs with a single case notion) to reduce the complexity of process models and discover homogeneous subsets of cases. Nevertheless, in real-life processes, particularly in the context of Business-to-Business (B2B) processes, multiple objects are involved in a process. Recently, Object-Centric Event Logs (OCELs) have been introduced to capture the information of such processes, and several process discovery techniques have been developed on top of OCELs. Yet, the output of the proposed discovery techniques on real OCELs leads to more informative but also more complex models. In this paper, we propose a clustering-based approach to cluster similar objects in OCELs to simplify the obtained process models. Using a case study of a real B2B process, we demonstrate that our approach reduces the complexity of the process models and generates coherent subsets of objects which help the end-users gain insights into the process.","PeriodicalId":36824,"journal":{"name":"Data","volume":"1 1","pages":"444-451"},"PeriodicalIF":2.6,"publicationDate":"2022-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46761795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Towards Programmable Memory Controller for Tensor Decomposition 面向张量分解的可编程存储器控制器

IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2022-07-17 DOI: 10.5220/0011301200003269

Sasindu Wijeratne, Ta-Yang Wang, R. Kannan, V. Prasanna

: Tensor decomposition has become an essential tool in many data science applications. Sparse Matricized Tensor Times Khatri-Rao Product (MTTKRP) is the pivotal kernel in tensor decomposition algorithms that decompose higher-order real-world large tensors into multiple matrices. Accelerating MTTKRP can speed up the tensor decomposition process immensely. Sparse MTTKRP is a challenging kernel to accelerate due to its irregular memory access characteristics. Implementing accelerators on Field Programmable Gate Array (FPGA) for kernels such as MTTKRP is attractive due to the energy efﬁciency and the inherent parallelism of FPGA. This paper explores the opportunities, key challenges, and an approach for designing a custom memory controller on FPGA for MTTKRP while exploring the parameter space of such a custom memory controller.

张量分解已经成为许多数据科学应用中必不可少的工具。稀疏矩阵化张量乘以Khatri-Rao积(MTTKRP)是将高阶现实世界大张量分解为多个矩阵的张量分解算法中的关键核心。加速MTTKRP可以大大加快张量分解过程。由于其不规则的内存访问特性，稀疏MTTKRP是一个具有挑战性的内核加速。由于FPGA的能效和固有的并行性，在MTTKRP等内核上实现加速器具有很大的吸引力。本文探讨了在FPGA上为MTTKRP设计自定义内存控制器的机会、关键挑战和方法，同时探索了这种自定义内存控制器的参数空间。

引用次数: 1