首页 > 最新文献

Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services最新文献

英文 中文
A Comparison of Two Database Partitioning Approaches that Support Taxonomy-Based Query Answering 支持基于分类的查询应答的两种数据库分区方法的比较
J. Schäfer, L. Wiese
In this paper we address the topic of identification of cohorts of similar patients in a database of electronic health records. We follow the conjecture that retrieval of similar patients can be supported by an underlying distributed database design. Hence we propose a fragmentation based on partitioning the health records and present a benchmark of two implementation variants in comparison to an off-the-shelf data distribution approach provided by Apache Ignite. While our main use case in this paper is cohort identification, our approach has advantages for taxonomy-based query answering in other (non-medical) domains.
在本文中,我们讨论了在电子健康记录数据库中识别相似患者队列的主题。我们遵循这样的猜想:类似患者的检索可以通过底层分布式数据库设计来支持。因此,我们提出了一种基于健康记录分区的碎片化方法,并提供了两种实现变体的基准测试,与Apache Ignite提供的现成数据分发方法进行比较。虽然我们在本文中的主要用例是队列识别,但我们的方法在其他(非医疗)领域具有基于分类法的查询应答的优势。
{"title":"A Comparison of Two Database Partitioning Approaches that Support Taxonomy-Based Query Answering","authors":"J. Schäfer, L. Wiese","doi":"10.1145/3428757.3429108","DOIUrl":"https://doi.org/10.1145/3428757.3429108","url":null,"abstract":"In this paper we address the topic of identification of cohorts of similar patients in a database of electronic health records. We follow the conjecture that retrieval of similar patients can be supported by an underlying distributed database design. Hence we propose a fragmentation based on partitioning the health records and present a benchmark of two implementation variants in comparison to an off-the-shelf data distribution approach provided by Apache Ignite. While our main use case in this paper is cohort identification, our approach has advantages for taxonomy-based query answering in other (non-medical) domains.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115027186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services 第22届信息集成与基于网络的应用与服务国际会议论文集
{"title":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","authors":"","doi":"10.1145/3428757","DOIUrl":"https://doi.org/10.1145/3428757","url":null,"abstract":"","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128229139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
KNNAC KNNAC
Yao Zhang, Yifeng Lu, Thomas Seidl
Density-based clustering algorithms are commonly adopted when arbitrarily shaped clusters exist. Usually, they do not need to know the number of clusters in prior, which is a big advantage. Conventional density-based approaches such as DBSCAN, utilize two parameters to define density. Recently, novel density-based clustering algorithms are proposed to reduce the problem complexity to the use of a single parameter k by utilizing the concepts of k Nearest Neighbor (kNN) and Reverse k Nearest Neighbor (RkNN) to define density. However, those kNN-based approaches are either ineffective or inefficient. In this paper, we present a new clustering algorithm KNNAC, which only requires computing the densities for a chosen subset of points due to the use of active core detection. We empirically show that, compared to other nearest neighbor based clustering approaches (e.g., RECORD, IS-DBSCAN, etc.), KNNAC can provide competitive performance while taking a fraction of the runtime.
{"title":"KNNAC","authors":"Yao Zhang, Yifeng Lu, Thomas Seidl","doi":"10.1145/3428757.3429135","DOIUrl":"https://doi.org/10.1145/3428757.3429135","url":null,"abstract":"Density-based clustering algorithms are commonly adopted when arbitrarily shaped clusters exist. Usually, they do not need to know the number of clusters in prior, which is a big advantage. Conventional density-based approaches such as DBSCAN, utilize two parameters to define density. Recently, novel density-based clustering algorithms are proposed to reduce the problem complexity to the use of a single parameter k by utilizing the concepts of k Nearest Neighbor (kNN) and Reverse k Nearest Neighbor (RkNN) to define density. However, those kNN-based approaches are either ineffective or inefficient. In this paper, we present a new clustering algorithm KNNAC, which only requires computing the densities for a chosen subset of points due to the use of active core detection. We empirically show that, compared to other nearest neighbor based clustering approaches (e.g., RECORD, IS-DBSCAN, etc.), KNNAC can provide competitive performance while taking a fraction of the runtime.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129404744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new Multi-Agents System based on Blockchain for Prediction Anomaly from System Logs 基于区块链的多智能体系统日志异常预测
Arwa Binlashram, Hajer Bouricha, L. Hsairi, Haneen Al Ahmadi
The execution traces generated by an application contain information that the developers believed would be useful in debugging or monitoring the application, it contains application states and significant events at various critical points that help them gain insight into failures and identify and predict potential problems before they occur. Despite the ubiquity of these traces universally in almost all computer systems, they are rarely exploited because they are not readily machine-parsable. In this paper, we propose a Multi-Agents approach for prediction process using Blockchain technology, which allows automatically analysis of execution traces and detects early warning signals for system failure prediction during executing. The proposed prediction approach is constructed using a four-layer Multi-Agents system architecture. The proposed prediction approach performance is based on data prepossessing and supervised learning algorithms for prediction. Blockchain was used to coordinate collaboration between agents, and to synchronize prediction between agents and the administrators. We validated our approach by applying it to real-world distributed systems, where we predicted problems before they occurred with high accuracy. In this paper we will focus on the Architecture of our prediction approach.
应用程序生成的执行跟踪包含开发人员认为在调试或监视应用程序时有用的信息,它包含应用程序状态和各种关键点上的重要事件,这些信息有助于他们深入了解故障,并在潜在问题发生之前识别和预测潜在问题。尽管这些痕迹普遍存在于几乎所有的计算机系统中,但它们很少被利用,因为它们不容易被机器解析。在本文中,我们提出了一种使用区块链技术进行预测过程的多代理方法,该方法允许自动分析执行轨迹并检测执行过程中系统故障预测的早期预警信号。提出的预测方法采用四层多智能体系统架构构建。所提出的预测方法的性能是基于数据预处理和监督学习算法进行预测。区块链用于协调代理之间的协作,并在代理和管理员之间同步预测。我们通过将其应用于真实的分布式系统来验证我们的方法,在那里我们可以高精度地预测问题发生之前。在本文中,我们将重点关注我们的预测方法的架构。
{"title":"A new Multi-Agents System based on Blockchain for Prediction Anomaly from System Logs","authors":"Arwa Binlashram, Hajer Bouricha, L. Hsairi, Haneen Al Ahmadi","doi":"10.1145/3428757.3429149","DOIUrl":"https://doi.org/10.1145/3428757.3429149","url":null,"abstract":"The execution traces generated by an application contain information that the developers believed would be useful in debugging or monitoring the application, it contains application states and significant events at various critical points that help them gain insight into failures and identify and predict potential problems before they occur. Despite the ubiquity of these traces universally in almost all computer systems, they are rarely exploited because they are not readily machine-parsable. In this paper, we propose a Multi-Agents approach for prediction process using Blockchain technology, which allows automatically analysis of execution traces and detects early warning signals for system failure prediction during executing. The proposed prediction approach is constructed using a four-layer Multi-Agents system architecture. The proposed prediction approach performance is based on data prepossessing and supervised learning algorithms for prediction. Blockchain was used to coordinate collaboration between agents, and to synchronize prediction between agents and the administrators. We validated our approach by applying it to real-world distributed systems, where we predicted problems before they occurred with high accuracy. In this paper we will focus on the Architecture of our prediction approach.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132863325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity Resolution 基于块分割的并行实体解析负载均衡策略分析与比较
Xiao Chen, Nishanth Entoor Venkatarathnam, Kirity Rapuru, David Broneske, Gabriel Campero Durand, Roman Zoun, G. Saake
Entity resolution (ER) is a process to identify records that refer to the same real-world entity. In recent years, facing the ever-increasing data volume, both blocking techniques and parallel computation have been proposed for ER to reduce its running time and improve efficiency. It is popular and convenient to apply the MapReduce programming model for parallel computation. With the default load balancing strategy, if the block sizes are skewed, an imbalanced reducer load will occur and significantly increase the runtime. One possible solution is block-splitting: breaking the overpopulated blocks into smaller sub-blocks, to improve efficiency. In this paper we analyze the advantages and disadvantages of state-of-the-art block splitting methods (BlockSplit and BlockSlicer), and we propose two approaches: TLS and BOS to overcome the identified drawbacks. We comprehensively evaluate and compare our proposed solutions, with Spark implementations, using real-world and synthetic datasets with different properties. The results show that all of them can balance the reducer load with the help of the greedy partition assignment strategy. When memory of used cluster is not abundant given a dataset, a high number of reducers is required to reduce the GC time to improve efficiency. Partitcularly, our TLS and BOS have overwelmingly lower overhead due to the ability of block-wise composite key assignment.
实体解析(ER)是一个识别引用相同现实世界实体的记录的过程。近年来,面对日益增长的数据量,为了缩短ER的运行时间和提高效率,人们提出了阻塞技术和并行计算技术。将MapReduce编程模型应用于并行计算是一种流行且方便的方法。使用默认的负载平衡策略,如果块大小倾斜,将发生不平衡的reducer负载,并显着增加运行时间。一个可能的解决方案是块分割:将人口过多的块分解成更小的子块,以提高效率。在本文中,我们分析了最先进的块分割方法(BlockSplit和BlockSlicer)的优点和缺点,并提出了两种方法:TLS和BOS来克服已确定的缺点。我们使用具有不同属性的真实数据集和合成数据集,对我们提出的解决方案与Spark实现进行了全面评估和比较。结果表明,在贪婪分区分配策略的帮助下,它们都能平衡减速机负载。当给定数据集的使用集群内存不足时,需要大量的reducer来减少GC时间以提高效率。特别是,我们的TLS和BOS由于能够按块组合密钥分配而具有非常低的开销。
{"title":"Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity Resolution","authors":"Xiao Chen, Nishanth Entoor Venkatarathnam, Kirity Rapuru, David Broneske, Gabriel Campero Durand, Roman Zoun, G. Saake","doi":"10.1145/3428757.3429140","DOIUrl":"https://doi.org/10.1145/3428757.3429140","url":null,"abstract":"Entity resolution (ER) is a process to identify records that refer to the same real-world entity. In recent years, facing the ever-increasing data volume, both blocking techniques and parallel computation have been proposed for ER to reduce its running time and improve efficiency. It is popular and convenient to apply the MapReduce programming model for parallel computation. With the default load balancing strategy, if the block sizes are skewed, an imbalanced reducer load will occur and significantly increase the runtime. One possible solution is block-splitting: breaking the overpopulated blocks into smaller sub-blocks, to improve efficiency. In this paper we analyze the advantages and disadvantages of state-of-the-art block splitting methods (BlockSplit and BlockSlicer), and we propose two approaches: TLS and BOS to overcome the identified drawbacks. We comprehensively evaluate and compare our proposed solutions, with Spark implementations, using real-world and synthetic datasets with different properties. The results show that all of them can balance the reducer load with the help of the greedy partition assignment strategy. When memory of used cluster is not abundant given a dataset, a high number of reducers is required to reduce the GC time to improve efficiency. Partitcularly, our TLS and BOS have overwelmingly lower overhead due to the ability of block-wise composite key assignment.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126964513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Patten Matcher for English Idioms on Web IndeX 网络索引英语习语的模式匹配器
Takumi Shinzato, Jun Nemoto, Motomichi Toyama
Web Index (WIX in short) is a system that achieves joining information resources on the Web. WIX replaces keywords in Web documents hyperlinks to other web pages based on a WIX file that a user chose. WIX file is a kind of a dictionary that have a set of WIX entries (keyword and target URL). Using WIX, users can join any Web contents and arbitrary dictionaries. In conventional WIX, matching and linking are executed only for fixed character strings between the keyword set and the input text. However, when a user wants to search for phrases like idioms, this matching system is not sufficient because of the declension of words, change of the verb tense, and so on. Therefore, we propose a phrasal pattern matching mechanism on WIX. This helps users easily find idiom expressions in the text on the web and get more information.
Web索引(Web Index,简称WIX)是一种实现网络上信息资源连接的系统。WIX根据用户选择的WIX文件替换Web文档中的关键字到其他Web页面的超链接。WIX文件是一种字典,它有一组WIX条目(关键字和目标URL)。使用WIX,用户可以连接任何Web内容和任意字典。在传统的WIX中,只对关键字集和输入文本之间的固定字符串执行匹配和链接。但是,当用户想要搜索习语之类的短语时,由于单词的变音、动词时态的变化等原因,这种匹配系统是不够的。因此,我们提出了一种基于WIX的短语模式匹配机制。这有助于用户在网络文本中轻松找到成语表达,并获得更多信息。
{"title":"A Patten Matcher for English Idioms on Web IndeX","authors":"Takumi Shinzato, Jun Nemoto, Motomichi Toyama","doi":"10.1145/3428757.3429136","DOIUrl":"https://doi.org/10.1145/3428757.3429136","url":null,"abstract":"Web Index (WIX in short) is a system that achieves joining information resources on the Web. WIX replaces keywords in Web documents hyperlinks to other web pages based on a WIX file that a user chose. WIX file is a kind of a dictionary that have a set of WIX entries (keyword and target URL). Using WIX, users can join any Web contents and arbitrary dictionaries. In conventional WIX, matching and linking are executed only for fixed character strings between the keyword set and the input text. However, when a user wants to search for phrases like idioms, this matching system is not sufficient because of the declension of words, change of the verb tense, and so on. Therefore, we propose a phrasal pattern matching mechanism on WIX. This helps users easily find idiom expressions in the text on the web and get more information.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125840140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Music Discovery as Differentiation Strategy for Streaming Providers 音乐发现作为流媒体提供商的差异化策略
Andreas Raff, Andreas Mladenow, C. Strauss
Music discovery presents itself in an instant and in a multitude of possible ways. When comparing the user personas of streaming services in the dimension of music discovery, two main differentiation criteria become apparent, namely the degree of intention and the control one wants to exert when discovering new music. Against this background, this paper proposes a framework which categorises the possible ways of music discovery in a streaming provider with the help of those two criteria into active, semi-active, semi-passive and passive ways in order to analyse them separately, outline success factors and current research.
音乐发现在瞬间以多种可能的方式呈现出来。当在音乐发现维度上比较流媒体服务的用户角色时,两个主要的区分标准变得明显,即发现新音乐时的意图程度和想要施加的控制。在此背景下,本文提出了一个框架,该框架将流媒体提供商在这两个标准的帮助下,将音乐发现的可能方式分为主动、半主动、半被动和被动,以便分别分析它们,概述成功因素和当前研究。
{"title":"Music Discovery as Differentiation Strategy for Streaming Providers","authors":"Andreas Raff, Andreas Mladenow, C. Strauss","doi":"10.1145/3428757.3429151","DOIUrl":"https://doi.org/10.1145/3428757.3429151","url":null,"abstract":"Music discovery presents itself in an instant and in a multitude of possible ways. When comparing the user personas of streaming services in the dimension of music discovery, two main differentiation criteria become apparent, namely the degree of intention and the control one wants to exert when discovering new music. Against this background, this paper proposes a framework which categorises the possible ways of music discovery in a streaming provider with the help of those two criteria into active, semi-active, semi-passive and passive ways in order to analyse them separately, outline success factors and current research.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121190627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Rammed, or What RAM3S Taught Us RAM3S教给我们什么
Ilaria Bartolini, M. Patella
RAM3S (Real-time Analysis of Massive MultiMedia Streams) is a framework that acts as a middleware software layer between multimedia stream analysis techniques and Big Data streaming platforms, so as to facilitate the implementation of the former on top of the latter. Indeed, the use of Big Data platforms can give way to the efficient management and analysis of large data amounts, but they require the user to concentrate on issues related to distributed computing, since their services are often too raw. The use of RAM3S greatly simplifies deploying non-parallel techniques to platforms like Apache Storm or Apache Flink, a fact that is demonstrated by the four different use cases we describe here. We detail the lessons we learned from exploiting RAM3S to implement the detailed use cases.
RAM3S (Real-time Analysis of Massive MultiMedia Streams)是一个介于多媒体流分析技术和大数据流平台之间的中间件软件层框架,便于大数据流分析技术在大数据流平台之上实现。的确,大数据平台的使用可以让位于对大量数据的有效管理和分析,但它们要求用户专注于与分布式计算相关的问题,因为它们的服务往往过于原始。RAM3S的使用极大地简化了在Apache Storm或Apache Flink等平台上部署非并行技术,我们在这里描述的四个不同用例证明了这一点。我们详细介绍了利用RAM3S实现详细用例的经验教训。
{"title":"Rammed, or What RAM3S Taught Us","authors":"Ilaria Bartolini, M. Patella","doi":"10.1145/3428757.3429098","DOIUrl":"https://doi.org/10.1145/3428757.3429098","url":null,"abstract":"RAM3S (Real-time Analysis of Massive MultiMedia Streams) is a framework that acts as a middleware software layer between multimedia stream analysis techniques and Big Data streaming platforms, so as to facilitate the implementation of the former on top of the latter. Indeed, the use of Big Data platforms can give way to the efficient management and analysis of large data amounts, but they require the user to concentrate on issues related to distributed computing, since their services are often too raw. The use of RAM3S greatly simplifies deploying non-parallel techniques to platforms like Apache Storm or Apache Flink, a fact that is demonstrated by the four different use cases we describe here. We detail the lessons we learned from exploiting RAM3S to implement the detailed use cases.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125358993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transfer Learning in Classifying Prescriptions and Keyword-based Medical Notes 迁移学习在处方分类和基于关键词的医学笔记中的应用
Mir Moynuddin Ahmed Shibly, Tahmina Akter Tisha, K. Islam, Md. Mohsin Uddin
Medical text classification is one of the primary steps of health care automation. Diagnosing disease at the right time, and going to the right doctor is important for patients. To do that, two types of medical texts were classified into some medical specialties in this study. The first one is the keywords-based medical notes and the second one is the prescriptions. There are many methods and techniques to classify texts from any domain. But, textual resources of a specific domain can be inadequate to build a sustainable and accurate classifier. This problem can be solved by incorporating transfer learning. The objective of this study is to analyze the prospects of transfer learning in medical text classification. To do that, a transfer learning system has been created for classification tasks by fine-tuning Bidirectional Encoder Representations from Transformers aka the BERT language model, and its performance has been compared with three deep learning models - multi-layer perceptron, long short-term memory, and convolutional neural network. The fine-tuned BERT model has shown the best performance among all the other models in both classification tasks. It has 0.84 and 0.96 weighted f1-score in classifying medical notes and prescriptions respectively. This study has proved that transfer learning can be used in medical text classification, and significant improvement in performance can be achieved through it.
医学文本分类是医疗保健自动化的基本步骤之一。在正确的时间诊断疾病,找正确的医生对病人来说很重要。为此,本研究将两种类型的医学文本划分为一些医学专业。第一种是基于关键词的医嘱,第二种是处方。有许多方法和技术可以对来自任何领域的文本进行分类。但是,特定领域的文本资源可能不足以构建可持续和准确的分类器。这个问题可以通过结合迁移学习来解决。本研究的目的是分析迁移学习在医学文本分类中的前景。为了做到这一点,通过微调来自Transformers的双向编码器表示(即BERT语言模型),为分类任务创建了一个迁移学习系统,并将其性能与三种深度学习模型(多层感知器、长短期记忆和卷积神经网络)进行了比较。在这两个分类任务中,经过微调的BERT模型都表现出了最好的性能。病历分类权重为0.84分,处方分类权重为0.96分。本研究证明迁移学习可以用于医学文本分类,并且可以显著提高性能。
{"title":"Transfer Learning in Classifying Prescriptions and Keyword-based Medical Notes","authors":"Mir Moynuddin Ahmed Shibly, Tahmina Akter Tisha, K. Islam, Md. Mohsin Uddin","doi":"10.1145/3428757.3429139","DOIUrl":"https://doi.org/10.1145/3428757.3429139","url":null,"abstract":"Medical text classification is one of the primary steps of health care automation. Diagnosing disease at the right time, and going to the right doctor is important for patients. To do that, two types of medical texts were classified into some medical specialties in this study. The first one is the keywords-based medical notes and the second one is the prescriptions. There are many methods and techniques to classify texts from any domain. But, textual resources of a specific domain can be inadequate to build a sustainable and accurate classifier. This problem can be solved by incorporating transfer learning. The objective of this study is to analyze the prospects of transfer learning in medical text classification. To do that, a transfer learning system has been created for classification tasks by fine-tuning Bidirectional Encoder Representations from Transformers aka the BERT language model, and its performance has been compared with three deep learning models - multi-layer perceptron, long short-term memory, and convolutional neural network. The fine-tuned BERT model has shown the best performance among all the other models in both classification tasks. It has 0.84 and 0.96 weighted f1-score in classifying medical notes and prescriptions respectively. This study has proved that transfer learning can be used in medical text classification, and significant improvement in performance can be achieved through it.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"4647 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122696031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mitigating Effect of Dictionary Matching Errors in Distantly Supervised Named Entity Recognition 字典匹配错误在远程监督命名实体识别中的缓解作用
Koga Kobayashi, Kei Wakabayashi
Named entity recognition (NER) is a fundamental technique that brings basic semantic awareness to natural language processing applications and services. Since we need a large amount of training data to train a custom NER model, distant supervision that leverages named entity dictionaries is expected to be a promising approach to train NER models quickly. However, dictionary matching causes a considerable number of errors that deteriorates both precision and recall of the final NER models, and we need to mitigate its effect. In this study, we particularly aim at improving precision of NER models by accounting for dictionary matching errors. Experimental results show that the proposed method can achieve an improvement of precisions especially under poor dictionary performance conditions.
命名实体识别(NER)是一种为自然语言处理应用程序和服务提供基本语义感知的基本技术。由于我们需要大量的训练数据来训练自定义NER模型,因此利用命名实体字典的远程监督有望成为快速训练NER模型的一种有前途的方法。然而,字典匹配会导致相当多的错误,从而降低最终NER模型的精度和召回率,我们需要减轻其影响。在本研究中,我们特别致力于通过考虑字典匹配误差来提高NER模型的精度。实验结果表明,在字典性能较差的情况下,该方法可以提高精度。
{"title":"Mitigating Effect of Dictionary Matching Errors in Distantly Supervised Named Entity Recognition","authors":"Koga Kobayashi, Kei Wakabayashi","doi":"10.1145/3428757.3429142","DOIUrl":"https://doi.org/10.1145/3428757.3429142","url":null,"abstract":"Named entity recognition (NER) is a fundamental technique that brings basic semantic awareness to natural language processing applications and services. Since we need a large amount of training data to train a custom NER model, distant supervision that leverages named entity dictionaries is expected to be a promising approach to train NER models quickly. However, dictionary matching causes a considerable number of errors that deteriorates both precision and recall of the final NER models, and we need to mitigate its effect. In this study, we particularly aim at improving precision of NER models by accounting for dictionary matching errors. Experimental results show that the proposed method can achieve an improvement of precisions especially under poor dictionary performance conditions.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124776615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1