The extensive growth of data quantity has posed many challenges to data analysis and retrieval. Noise and redundancy are typical representatives of the above-mentioned challenges, which may reduce the reliability of analysis and retrieval results and increase storage and computing overhead. To solve the above problems, a two-stage data pre-processing framework for noise identification and data reduction, called ARIS, is proposed in this article. The first stage identifies and removes noises by the following steps: First, the influence space (IS) is introduced to elaborate data distribution. Second, a ranking factor (RF) is defined to describe the possibility that the points are regarded as noises, then, the definition of noise is given based on RF. Third, a clean dataset (CD) is obtained by removing noise from the original dataset. The second stage learns representative data and realizes data reduction. In this process, CD is divided into multiple small regions by IS. Then the reduced dataset is formed by collecting the representations of each region. The performance of ARIS is verified by experiments on artificial and real datasets. Experimental results show that ARIS effectively weakens the impact of noise and reduces the amount of data and significantly improves the accuracy of data analysis within a reasonable time cost range.
{"title":"ARIS: A Noise Insensitive Data Pre-Processing Scheme for Data Reduction Using Influence Space","authors":"Jiang-hui Cai, Yuqing Yang, Haifeng Yang, Xu-jun Zhao, Jing Hao","doi":"10.1145/3522592","DOIUrl":"https://doi.org/10.1145/3522592","url":null,"abstract":"The extensive growth of data quantity has posed many challenges to data analysis and retrieval. Noise and redundancy are typical representatives of the above-mentioned challenges, which may reduce the reliability of analysis and retrieval results and increase storage and computing overhead. To solve the above problems, a two-stage data pre-processing framework for noise identification and data reduction, called ARIS, is proposed in this article. The first stage identifies and removes noises by the following steps: First, the influence space (IS) is introduced to elaborate data distribution. Second, a ranking factor (RF) is defined to describe the possibility that the points are regarded as noises, then, the definition of noise is given based on RF. Third, a clean dataset (CD) is obtained by removing noise from the original dataset. The second stage learns representative data and realizes data reduction. In this process, CD is divided into multiple small regions by IS. Then the reduced dataset is formed by collecting the representations of each region. The performance of ARIS is verified by experiments on artificial and real datasets. Experimental results show that ARIS effectively weakens the impact of noise and reduces the amount of data and significantly improves the accuracy of data analysis within a reasonable time cost range.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126486810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rapid development of text mining, many studies observe that text generally contains a variety of implicit information, and it is important to develop techniques for extracting such information. Named Entity Recognition (NER), the first step of information extraction, mainly identifies names of persons, locations, and organizations in text. Although existing neural-based NER approaches achieve great success in many language domains, most of them normally ignore the nested nature of named entities. Recently, diverse studies focus on the nested NER problem and yield state-of-the-art performance. This survey attempts to provide a comprehensive review on existing approaches for nested NER from the perspectives of the model architecture and the model property, which may help readers have a better understanding of the current research status and ideas. In this survey, we first introduce the background of nested NER, especially the differences between nested NER and traditional (i.e., flat) NER. We then review the existing nested NER approaches from 2002 to 2020 and mainly classify them into five categories according to the model architecture, including early rule-based, layered-based, region-based, hypergraph-based, and transition-based approaches. We also explore in greater depth the impact of key properties unique to nested NER approaches from the model property perspective, namely entity dependency, stage framework, error propagation, and tag scheme. Finally, we summarize the open challenges and point out a few possible future directions in this area. This survey would be useful for three kinds of readers: (i) Newcomers in the field who want to learn about NER, especially for nested NER. (ii) Researchers who want to clarify the relationship and advantages between flat NER and nested NER. (iii) Practitioners who just need to determine which NER technique (i.e., nested or not) works best in their applications.
{"title":"Nested Named Entity Recognition: A Survey","authors":"Yu Wang, H. Tong, Ziye Zhu, Yun Li","doi":"10.1145/3522593","DOIUrl":"https://doi.org/10.1145/3522593","url":null,"abstract":"With the rapid development of text mining, many studies observe that text generally contains a variety of implicit information, and it is important to develop techniques for extracting such information. Named Entity Recognition (NER), the first step of information extraction, mainly identifies names of persons, locations, and organizations in text. Although existing neural-based NER approaches achieve great success in many language domains, most of them normally ignore the nested nature of named entities. Recently, diverse studies focus on the nested NER problem and yield state-of-the-art performance. This survey attempts to provide a comprehensive review on existing approaches for nested NER from the perspectives of the model architecture and the model property, which may help readers have a better understanding of the current research status and ideas. In this survey, we first introduce the background of nested NER, especially the differences between nested NER and traditional (i.e., flat) NER. We then review the existing nested NER approaches from 2002 to 2020 and mainly classify them into five categories according to the model architecture, including early rule-based, layered-based, region-based, hypergraph-based, and transition-based approaches. We also explore in greater depth the impact of key properties unique to nested NER approaches from the model property perspective, namely entity dependency, stage framework, error propagation, and tag scheme. Finally, we summarize the open challenges and point out a few possible future directions in this area. This survey would be useful for three kinds of readers: (i) Newcomers in the field who want to learn about NER, especially for nested NER. (ii) Researchers who want to clarify the relationship and advantages between flat NER and nested NER. (iii) Practitioners who just need to determine which NER technique (i.e., nested or not) works best in their applications.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125606682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Experimental results are often plotted as 2-dimensional graphical plots (aka graphs) in scientific domains depicting dependent versus independent variables to aid visual analysis of processes. Repeatedly performing laboratory experiments consumes significant time and resources, motivating the need for computational estimation. The goals are to estimate the graph obtained in an experiment given its input conditions, and to estimate the conditions that would lead to a desired graph. Existing estimation approaches often do not meet accuracy and efficiency needs of targeted applications. We develop a computational estimation approach called AutoDomainMine that integrates clustering and classification over complex scientific data in a framework so as to automate classical learning methods of scientists. Knowledge discovered thereby from a database of existing experiments serves as the basis for estimation. Challenges include preserving domain semantics in clustering, finding matching strategies in classification, striking a good balance between elaboration and conciseness while displaying estimation results based on needs of targeted users, and deriving objective measures to capture subjective user interests. These and other challenges are addressed in this work. The AutoDomainMine approach is used to build a computational estimation system, rigorously evaluated with real data in Materials Science. Our evaluation confirms that AutoDomainMine provides desired accuracy and efficiency in computational estimation. It is extendable to other science and engineering domains as proved by adaptation of its sub-processes within fields such as Bioinformatics and Nanotechnology.
{"title":"Computational Estimation by Scientific Data Mining with Classical Methods to Automate Learning Strategies of Scientists","authors":"A. Varde","doi":"10.1145/3502736","DOIUrl":"https://doi.org/10.1145/3502736","url":null,"abstract":"Experimental results are often plotted as 2-dimensional graphical plots (aka graphs) in scientific domains depicting dependent versus independent variables to aid visual analysis of processes. Repeatedly performing laboratory experiments consumes significant time and resources, motivating the need for computational estimation. The goals are to estimate the graph obtained in an experiment given its input conditions, and to estimate the conditions that would lead to a desired graph. Existing estimation approaches often do not meet accuracy and efficiency needs of targeted applications. We develop a computational estimation approach called AutoDomainMine that integrates clustering and classification over complex scientific data in a framework so as to automate classical learning methods of scientists. Knowledge discovered thereby from a database of existing experiments serves as the basis for estimation. Challenges include preserving domain semantics in clustering, finding matching strategies in classification, striking a good balance between elaboration and conciseness while displaying estimation results based on needs of targeted users, and deriving objective measures to capture subjective user interests. These and other challenges are addressed in this work. The AutoDomainMine approach is used to build a computational estimation system, rigorously evaluated with real data in Materials Science. Our evaluation confirms that AutoDomainMine provides desired accuracy and efficiency in computational estimation. It is extendable to other science and engineering domains as proved by adaptation of its sub-processes within fields such as Bioinformatics and Nanotechnology.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121509903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Name ambiguity is a prevalent problem in scholarly publications due to the unprecedented growth of digital libraries and number of researchers. An author is identified by their name in the absence of a unique identifier. The documents of an author are mistakenly assigned due to underlying ambiguity, which may lead to an improper assessment of the author. Various efforts have been made in the literature to solve the name disambiguation problem with supervised and unsupervised approaches. The unsupervised approaches for author name disambiguation are preferred due to the availability of a large amount of unlabeled data. Bibliographic data contain heterogeneous features, thus recently, representation learning-based techniques have been used in literature to embed heterogeneous features in common space. Documents of a scholar are connected by multiple relations. Recently, research has shifted from a single homogeneous relation to multi-dimensional (heterogeneous) relations for the latent representation of document. Connections in graphs are sparse, and higher order links between documents give an additional clue. Therefore, we have used multiple neighborhoods in different relation types in heterogeneous graph for representation of documents. However, different order neighborhood in each relation type has different importance which we have empirically validated also. Therefore, to properly utilize the different neighborhoods in relation type and importance of each relation type in the heterogeneous graph, we propose attention-based multi-dimensional multi-hop neighborhood-based graph convolution network for embedding that uses the two levels of an attention, namely, (i) relation level and (ii) neighborhood level, in each relation. A significant improvement over existing state-of-the-art methods in terms of various evaluation matrices has been obtained by the proposed approach.
{"title":"Exploiting Higher Order Multi-dimensional Relationships with Self-attention for Author Name Disambiguation","authors":"K. Pooja, S. Mondal, Joydeep Chandra","doi":"10.1145/3502730","DOIUrl":"https://doi.org/10.1145/3502730","url":null,"abstract":"Name ambiguity is a prevalent problem in scholarly publications due to the unprecedented growth of digital libraries and number of researchers. An author is identified by their name in the absence of a unique identifier. The documents of an author are mistakenly assigned due to underlying ambiguity, which may lead to an improper assessment of the author. Various efforts have been made in the literature to solve the name disambiguation problem with supervised and unsupervised approaches. The unsupervised approaches for author name disambiguation are preferred due to the availability of a large amount of unlabeled data. Bibliographic data contain heterogeneous features, thus recently, representation learning-based techniques have been used in literature to embed heterogeneous features in common space. Documents of a scholar are connected by multiple relations. Recently, research has shifted from a single homogeneous relation to multi-dimensional (heterogeneous) relations for the latent representation of document. Connections in graphs are sparse, and higher order links between documents give an additional clue. Therefore, we have used multiple neighborhoods in different relation types in heterogeneous graph for representation of documents. However, different order neighborhood in each relation type has different importance which we have empirically validated also. Therefore, to properly utilize the different neighborhoods in relation type and importance of each relation type in the heterogeneous graph, we propose attention-based multi-dimensional multi-hop neighborhood-based graph convolution network for embedding that uses the two levels of an attention, namely, (i) relation level and (ii) neighborhood level, in each relation. A significant improvement over existing state-of-the-art methods in terms of various evaluation matrices has been obtained by the proposed approach.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132207812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online bipartite matching has attracted wide interest since it can successfully model the popular online car-hailing problem and sharing economy. Existing works consider this problem under either adversary setting or i.i.d. setting. The former is too pessimistic to improve the performance in the general case; the latter is too optimistic to deal with the varying distribution of vertices. In this article, we initiate the study of the non-stationary online bipartite matching problem, which allows the distribution of vertices to vary with time and is more practical. We divide the non-stationary online bipartite matching problem into two subproblems, the matching problem and the selecting problem, and solve them individually. Combining Batch algorithms and deep Q-learning networks, we first construct a candidate algorithm set to solve the matching problem. For the selecting problem, we use a classical online learning algorithm, Exp3, as a selector algorithm and derive a theoretical bound. We further propose CDUCB as a selector algorithm by integrating distribution change detection into UCB. Rigorous theoretical analysis demonstrates that the performance of our proposed algorithms is no worse than that of any candidate algorithms in terms of competitive ratio. Finally, extensive experiments show that our proposed algorithms have much higher performance for the non-stationary online bipartite matching problem comparing to the state-of-the-art.
{"title":"Online Learning Bipartite Matching with Non-stationary Distributions","authors":"Weirong Chen, Jiaqi Zheng, Haoyu Yu, Guihai Chen, Yixing Chen, Dongsheng Li","doi":"10.1145/3502734","DOIUrl":"https://doi.org/10.1145/3502734","url":null,"abstract":"Online bipartite matching has attracted wide interest since it can successfully model the popular online car-hailing problem and sharing economy. Existing works consider this problem under either adversary setting or i.i.d. setting. The former is too pessimistic to improve the performance in the general case; the latter is too optimistic to deal with the varying distribution of vertices. In this article, we initiate the study of the non-stationary online bipartite matching problem, which allows the distribution of vertices to vary with time and is more practical. We divide the non-stationary online bipartite matching problem into two subproblems, the matching problem and the selecting problem, and solve them individually. Combining Batch algorithms and deep Q-learning networks, we first construct a candidate algorithm set to solve the matching problem. For the selecting problem, we use a classical online learning algorithm, Exp3, as a selector algorithm and derive a theoretical bound. We further propose CDUCB as a selector algorithm by integrating distribution change detection into UCB. Rigorous theoretical analysis demonstrates that the performance of our proposed algorithms is no worse than that of any candidate algorithms in terms of competitive ratio. Finally, extensive experiments show that our proposed algorithms have much higher performance for the non-stationary online bipartite matching problem comparing to the state-of-the-art.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127400672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data analysis involves the deployment of sophisticated approaches from data mining methods, information theory, and artificial intelligence in various fields like tourism, hospitality, and so on for the extraction of knowledge from the gathered and preprocessed data. In tourism, pattern analysis or data analysis using classification is significant for finding the patterns that represent new and potentially useful information or knowledge about the destination and other data. Several data mining techniques are introduced for the classification of data or patterns. However, overfitting, less accuracy, local minima, sensitive to noise are the drawbacks in some existing data mining classification methods. To overcome these challenges, Support vector machine with Red deer optimization (SVM-RDO) based data mining strategy is proposed in this article. Extended Kalman filter (EKF) is utilized in the first phase, i.e., data cleaning to remove the noise and missing values from the input data. Mantaray foraging algorithm (MaFA) is used in the data selection phase, in which the significant data are selected for the further process to reduce the computational complexity. The final phase is the classification, in which SVM-RDO is proposed to access the useful pattern from the selected data. PYTHON is the implementation tool used for the experiment of the proposed model. The experimental analysis is done to show the efficacy of the proposed work. From the experimental results, the proposed SVM-RDO achieved better accuracy, precision, recall, and F1 score than the existing methods for the tourism dataset. Thus, it is showed the effectiveness of the proposed SVM-RDO for pattern analysis.
{"title":"Intelligent Data Analysis using Optimized Support Vector Machine Based Data Mining Approach for Tourism Industry","authors":"Ms Promila Sharma, Uma Meena, Girish Sharma","doi":"10.1145/3494566","DOIUrl":"https://doi.org/10.1145/3494566","url":null,"abstract":"Data analysis involves the deployment of sophisticated approaches from data mining methods, information theory, and artificial intelligence in various fields like tourism, hospitality, and so on for the extraction of knowledge from the gathered and preprocessed data. In tourism, pattern analysis or data analysis using classification is significant for finding the patterns that represent new and potentially useful information or knowledge about the destination and other data. Several data mining techniques are introduced for the classification of data or patterns. However, overfitting, less accuracy, local minima, sensitive to noise are the drawbacks in some existing data mining classification methods. To overcome these challenges, Support vector machine with Red deer optimization (SVM-RDO) based data mining strategy is proposed in this article. Extended Kalman filter (EKF) is utilized in the first phase, i.e., data cleaning to remove the noise and missing values from the input data. Mantaray foraging algorithm (MaFA) is used in the data selection phase, in which the significant data are selected for the further process to reduce the computational complexity. The final phase is the classification, in which SVM-RDO is proposed to access the useful pattern from the selected data. PYTHON is the implementation tool used for the experiment of the proposed model. The experimental analysis is done to show the efficacy of the proposed work. From the experimental results, the proposed SVM-RDO achieved better accuracy, precision, recall, and F1 score than the existing methods for the tourism dataset. Thus, it is showed the effectiveness of the proposed SVM-RDO for pattern analysis.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126270374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In topic models, collections are organized as documents where they arise as mixtures over latent clusters called topics. A topic is a distribution over the vocabulary. In large-scale applications, parametric or finite topic mixture models such as LDA (latent Dirichlet allocation) and its variants are very restrictive in performance due to their reduced hypothesis space. In this article, we address the problem related to model selection and sharing ability of topics across multiple documents in standard parametric topic models. We propose as an alternative a BNP (Bayesian nonparametric) topic model where the HDP (hierarchical Dirichlet process) prior models documents topic mixtures through their multinomials on infinite simplex. We, therefore, propose asymmetric BL (Beta-Liouville) as a diffuse base measure at the corpus level DP (Dirichlet process) over a measurable space. This step illustrates the highly heterogeneous structure in the set of all topics that describes the corpus probability measure. For consistency in posterior inference and predictive distributions, we efficiently characterize random probability measures whose limits are the global and local DPs to approximate the HDP from the stick-breaking formulation with the GEM (Griffiths-Engen-McCloskey) random variables. Due to the diffuse measure with the BL prior as conjugate to the count data distribution, we obtain an improved version of the standard HDP that is usually based on symmetric Dirichlet (Dir). In addition, to improve coordinate ascent framework while taking advantage of its deterministic nature, our model implements an online optimization method based on stochastic, at document level, variational inference to accommodate fast topic learning when processing large collections of text documents with natural gradient. The high value in the predictive likelihood per document obtained when compared to the performance of its competitors is also consistent with the robustness of our fully asymmetric BL-based HDP. While insuring the predictive accuracy of the model using the probability of the held-out documents, we also added a combination of metrics such as the topic coherence and topic diversity to improve the quality and interpretability of the topics discovered. We also compared the performance of our model using these metrics against the standard symmetric LDA. We show that online HDP-LBLA (Latent BL Allocation)’s performance is the asymptote for parametric topic models. The accuracy in the results (improved predictive distributions of the held out) is a product of the model’s ability to efficiently characterize dependency between documents (topic correlation) as now they can easily share topics, resulting in a much robust and realistic compression algorithm for information modeling.
{"title":"Stochastic Variational Optimization of a Hierarchical Dirichlet Process Latent Beta-Liouville Topic Model","authors":"Koffi Eddy Ihou, Manar Amayri, N. Bouguila","doi":"10.1145/3502727","DOIUrl":"https://doi.org/10.1145/3502727","url":null,"abstract":"In topic models, collections are organized as documents where they arise as mixtures over latent clusters called topics. A topic is a distribution over the vocabulary. In large-scale applications, parametric or finite topic mixture models such as LDA (latent Dirichlet allocation) and its variants are very restrictive in performance due to their reduced hypothesis space. In this article, we address the problem related to model selection and sharing ability of topics across multiple documents in standard parametric topic models. We propose as an alternative a BNP (Bayesian nonparametric) topic model where the HDP (hierarchical Dirichlet process) prior models documents topic mixtures through their multinomials on infinite simplex. We, therefore, propose asymmetric BL (Beta-Liouville) as a diffuse base measure at the corpus level DP (Dirichlet process) over a measurable space. This step illustrates the highly heterogeneous structure in the set of all topics that describes the corpus probability measure. For consistency in posterior inference and predictive distributions, we efficiently characterize random probability measures whose limits are the global and local DPs to approximate the HDP from the stick-breaking formulation with the GEM (Griffiths-Engen-McCloskey) random variables. Due to the diffuse measure with the BL prior as conjugate to the count data distribution, we obtain an improved version of the standard HDP that is usually based on symmetric Dirichlet (Dir). In addition, to improve coordinate ascent framework while taking advantage of its deterministic nature, our model implements an online optimization method based on stochastic, at document level, variational inference to accommodate fast topic learning when processing large collections of text documents with natural gradient. The high value in the predictive likelihood per document obtained when compared to the performance of its competitors is also consistent with the robustness of our fully asymmetric BL-based HDP. While insuring the predictive accuracy of the model using the probability of the held-out documents, we also added a combination of metrics such as the topic coherence and topic diversity to improve the quality and interpretability of the topics discovered. We also compared the performance of our model using these metrics against the standard symmetric LDA. We show that online HDP-LBLA (Latent BL Allocation)’s performance is the asymptote for parametric topic models. The accuracy in the results (improved predictive distributions of the held out) is a product of the model’s ability to efficiently characterize dependency between documents (topic correlation) as now they can easily share topics, resulting in a much robust and realistic compression algorithm for information modeling.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131734788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Davvetas, I. Klampanos, Spiros Skiadopoulos, V. Karkaletsis
Unsupervised representation learning tends to produce generic and reusable latent representations. However, these representations can often miss high-level features or semantic information, since they only observe the implicit properties of the dataset. On the other hand, supervised learning frameworks learn task-oriented latent representations that may not generalise in other tasks or domains. In this article, we introduce evidence transfer, a deep learning method that incorporates the outcomes of external tasks in the unsupervised learning process of an autoencoder. External task outcomes also referred to as categorical evidence, are represented by categorical variables, and are either directly or indirectly related to the primary dataset—in the most straightforward case they are the outcome of another task on the same dataset. Evidence transfer allows the manipulation of generic latent representations in order to include domain or task-specific knowledge that will aid their effectiveness in downstream tasks. Evidence transfer is robust against evidence of low quality and effective when introduced with related, corresponding, or meaningful evidence.
{"title":"Evidence Transfer: Learning Improved Representations According to External Heterogeneous Task Outcomes","authors":"A. Davvetas, I. Klampanos, Spiros Skiadopoulos, V. Karkaletsis","doi":"10.1145/3502732","DOIUrl":"https://doi.org/10.1145/3502732","url":null,"abstract":"Unsupervised representation learning tends to produce generic and reusable latent representations. However, these representations can often miss high-level features or semantic information, since they only observe the implicit properties of the dataset. On the other hand, supervised learning frameworks learn task-oriented latent representations that may not generalise in other tasks or domains. In this article, we introduce evidence transfer, a deep learning method that incorporates the outcomes of external tasks in the unsupervised learning process of an autoencoder. External task outcomes also referred to as categorical evidence, are represented by categorical variables, and are either directly or indirectly related to the primary dataset—in the most straightforward case they are the outcome of another task on the same dataset. Evidence transfer allows the manipulation of generic latent representations in order to include domain or task-specific knowledge that will aid their effectiveness in downstream tasks. Evidence transfer is robust against evidence of low quality and effective when introduced with related, corresponding, or meaningful evidence.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126251878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaolong Ling, Kui Yu, Lin Liu, Jiuyong Li, Yiwen Zhang, Xindong Wu
Learning partial Bayesian network (BN) structure is an interesting and challenging problem. In this challenge, it is computationally expensive to use global BN structure learning algorithms, while only one part of a BN structure is interesting, local BN structure learning algorithms are not a favourable solution either due to the issue of false edge orientation. To address the problem, this article first presents a detailed analysis of the false edge orientation issue with local BN structure learning algorithms and then proposes PSL, an efficient and accurate Partial BN Structure Learning (PSL) algorithm. Specifically, PSL divides V-structures in a Markov blanket (MB) into two types: Type-C V-structures and Type-NC V-structures, then it starts from the given node of interest and recursively finds both types of V-structures in the MB of the current node until all edges in the partial BN structure are oriented. To further improve the efficiency of PSL, the PSL-FS algorithm is designed by incorporating Feature Selection (FS) into PSL. Extensive experiments with six benchmark BNs validate the efficiency and accuracy of the proposed algorithms.
学习部分贝叶斯网络(BN)结构是一个有趣而富有挑战性的问题。在这个挑战中,使用全局BN结构学习算法的计算成本很高,而BN结构只有一部分是有趣的,局部BN结构学习算法也不是一个有利的解决方案,因为存在假边方向问题。为了解决这个问题,本文首先详细分析了局部BN结构学习算法的假边方向问题,然后提出了一种高效、准确的部分BN结构学习(PSL)算法。具体来说,PSL将马尔可夫毯(MB)中的v -结构分为Type-C v -结构和Type-NC v -结构两种类型,然后从给定的感兴趣节点开始,递归地在当前节点的MB中找到这两种类型的v -结构,直到部分BN结构中的所有边都有取向。为了进一步提高PSL的效率,将特征选择(Feature Selection, FS)引入PSL,设计了PSL-FS算法。用6个基准神经网络进行了大量实验,验证了所提算法的效率和准确性。
{"title":"PSL: An Algorithm for Partial Bayesian Network Structure Learning","authors":"Zhaolong Ling, Kui Yu, Lin Liu, Jiuyong Li, Yiwen Zhang, Xindong Wu","doi":"10.1145/3508071","DOIUrl":"https://doi.org/10.1145/3508071","url":null,"abstract":"Learning partial Bayesian network (BN) structure is an interesting and challenging problem. In this challenge, it is computationally expensive to use global BN structure learning algorithms, while only one part of a BN structure is interesting, local BN structure learning algorithms are not a favourable solution either due to the issue of false edge orientation. To address the problem, this article first presents a detailed analysis of the false edge orientation issue with local BN structure learning algorithms and then proposes PSL, an efficient and accurate Partial BN Structure Learning (PSL) algorithm. Specifically, PSL divides V-structures in a Markov blanket (MB) into two types: Type-C V-structures and Type-NC V-structures, then it starts from the given node of interest and recursively finds both types of V-structures in the MB of the current node until all edges in the partial BN structure are oriented. To further improve the efficiency of PSL, the PSL-FS algorithm is designed by incorporating Feature Selection (FS) into PSL. Extensive experiments with six benchmark BNs validate the efficiency and accuracy of the proposed algorithms.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122597226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feature selection is one of the core concepts in machine learning, which hugely impacts the model’s performance. For some real-world applications, features may exist in a stream mode that arrives one by one over time, while we cannot know the exact number of features before learning. Online streaming feature selection aims at selecting optimal stream features at each timestamp on the fly. Without the global information of the entire feature space, most of the existing methods select stream features in terms of individual feature information or the comparison of features in pairs. This article proposes a new online scalable streaming feature selection framework from the dynamic decision perspective that is scalable on running time and selected features by dynamic threshold adjustment. Regarding the philosophy of “Thinking-in-Threes”, we classify each new arrival feature as selecting, discarding, or delaying, aiming at minimizing the overall decision risks. With the dynamic updating of global statistical information, we add the selecting features into the candidate feature subset, ignore the discarding features, cache the delaying features into the undetermined feature subset, and wait for more information. Meanwhile, we perform the redundancy analysis for the candidate features and uncertainty analysis for the undetermined features. Extensive experiments on eleven real-world datasets demonstrate the efficiency and scalability of our new framework compared with state-of-the-art algorithms.
{"title":"Online Scalable Streaming Feature Selection via Dynamic Decision","authors":"Peng Zhou, Shu Zhao, Yuan-Ting Yan, X. Wu","doi":"10.1145/3502737","DOIUrl":"https://doi.org/10.1145/3502737","url":null,"abstract":"Feature selection is one of the core concepts in machine learning, which hugely impacts the model’s performance. For some real-world applications, features may exist in a stream mode that arrives one by one over time, while we cannot know the exact number of features before learning. Online streaming feature selection aims at selecting optimal stream features at each timestamp on the fly. Without the global information of the entire feature space, most of the existing methods select stream features in terms of individual feature information or the comparison of features in pairs. This article proposes a new online scalable streaming feature selection framework from the dynamic decision perspective that is scalable on running time and selected features by dynamic threshold adjustment. Regarding the philosophy of “Thinking-in-Threes”, we classify each new arrival feature as selecting, discarding, or delaying, aiming at minimizing the overall decision risks. With the dynamic updating of global statistical information, we add the selecting features into the candidate feature subset, ignore the discarding features, cache the delaying features into the undetermined feature subset, and wait for more information. Meanwhile, we perform the redundancy analysis for the candidate features and uncertainty analysis for the undetermined features. Extensive experiments on eleven real-world datasets demonstrate the efficiency and scalability of our new framework compared with state-of-the-art algorithms.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124599057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}