In this paper, we propose a novel representation learning framework, namely HIN2Vec, for heterogeneous information networks (HINs). The core of the proposed framework is a neural network model, also called HIN2Vec, designed to capture the rich semantics embedded in HINs by exploiting different types of relationships among nodes. Given a set of relationships specified in forms of meta-paths in an HIN, HIN2Vec carries out multiple prediction training tasks jointly based on a target set of relationships to learn latent vectors of nodes and meta-paths in the HIN. In addition to model design, several issues unique to HIN2Vec, including regularization of meta-path vectors, node type selection in negative sampling, and cycles in random walks, are examined. To validate our ideas, we learn latent vectors of nodes using four large-scale real HIN datasets, including Blogcatalog, Yelp, DBLP and U.S. Patents, and use them as features for multi-label node classification and link prediction applications on those networks. Empirical results show that HIN2Vec soundly outperforms the state-of-the-art representation learning models for network data, including DeepWalk, LINE, node2vec, PTE, HINE and ESim, by 6.6% to 23.8% of $micro$-$f_1$ in multi-label node classification and 5% to 70.8% of $MAP$ in link prediction.
{"title":"HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning","authors":"Tao-yang Fu, Wang-Chien Lee, Zhen Lei","doi":"10.1145/3132847.3132953","DOIUrl":"https://doi.org/10.1145/3132847.3132953","url":null,"abstract":"In this paper, we propose a novel representation learning framework, namely HIN2Vec, for heterogeneous information networks (HINs). The core of the proposed framework is a neural network model, also called HIN2Vec, designed to capture the rich semantics embedded in HINs by exploiting different types of relationships among nodes. Given a set of relationships specified in forms of meta-paths in an HIN, HIN2Vec carries out multiple prediction training tasks jointly based on a target set of relationships to learn latent vectors of nodes and meta-paths in the HIN. In addition to model design, several issues unique to HIN2Vec, including regularization of meta-path vectors, node type selection in negative sampling, and cycles in random walks, are examined. To validate our ideas, we learn latent vectors of nodes using four large-scale real HIN datasets, including Blogcatalog, Yelp, DBLP and U.S. Patents, and use them as features for multi-label node classification and link prediction applications on those networks. Empirical results show that HIN2Vec soundly outperforms the state-of-the-art representation learning models for network data, including DeepWalk, LINE, node2vec, PTE, HINE and ESim, by 6.6% to 23.8% of $micro$-$f_1$ in multi-label node classification and 5% to 70.8% of $MAP$ in link prediction.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88204773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohamed Reda Bouadjenek, Karin M. Verspoor, J. Zobel
We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The sequence type attribute provides important information about the nature of a sequence represented in a record, and is often used in search to filter out irrelevant sequences. However, the sequence type attribute is generally a non-mandatory free-text field, and thus it is subject to many errors including typos, mis-assignment, and non-assignment. In GenBank, this problem concerns roughly 18% of records, an alarming number that should worry the biocuration community. To address this problem of automatic sequence type classification, we propose the use of literature associated to sequence records as an external source of knowledge that can be leveraged for the classification task. We define a set of literature-based features and train a machine learning algorithm to classify a record into one of six primary sequence types. The main intuition behind using the literature for this task is that sequences appear to be discussed differently in scientific articles, depending on their type. The experiments we have conducted on the PubMed Central collection show that the literature is indeed an effective way to address this problem of sequence type classification. Our classification method reached an accuracy of 92.7%, and substantially outperformed two baseline approaches used for comparison.
本文对生物序列数据库中记录的生物序列类型自动分类进行了探讨。序列类型属性提供了关于记录中表示的序列性质的重要信息,并且经常用于搜索以过滤掉不相关的序列。然而,序列类型属性通常是一个非强制性的自由文本字段,因此它容易出现许多错误,包括打字错误、赋值错误和未赋值。在GenBank中,这个问题涉及大约18%的记录,这一惊人的数字应该引起生物存储界的担忧。为了解决自动序列类型分类的问题,我们建议使用与序列记录相关的文献作为可用于分类任务的外部知识来源。我们定义了一组基于文献的特征,并训练机器学习算法将记录分类为六种主要序列类型之一。使用文献来完成这项任务的主要直觉是,序列似乎在科学文章中以不同的方式讨论,取决于它们的类型。我们在PubMed Central collection上进行的实验表明,文献确实是解决序列类型分类问题的有效方法。我们的分类方法达到了92.7%的准确率,并且大大优于用于比较的两种基线方法。
{"title":"Learning Biological Sequence Types Using the Literature","authors":"Mohamed Reda Bouadjenek, Karin M. Verspoor, J. Zobel","doi":"10.1145/3132847.3133051","DOIUrl":"https://doi.org/10.1145/3132847.3133051","url":null,"abstract":"We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The sequence type attribute provides important information about the nature of a sequence represented in a record, and is often used in search to filter out irrelevant sequences. However, the sequence type attribute is generally a non-mandatory free-text field, and thus it is subject to many errors including typos, mis-assignment, and non-assignment. In GenBank, this problem concerns roughly 18% of records, an alarming number that should worry the biocuration community. To address this problem of automatic sequence type classification, we propose the use of literature associated to sequence records as an external source of knowledge that can be leveraged for the classification task. We define a set of literature-based features and train a machine learning algorithm to classify a record into one of six primary sequence types. The main intuition behind using the literature for this task is that sequences appear to be discussed differently in scientific articles, depending on their type. The experiments we have conducted on the PubMed Central collection show that the literature is indeed an effective way to address this problem of sequence type classification. Our classification method reached an accuracy of 92.7%, and substantially outperformed two baseline approaches used for comparison.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"137 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86462155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sandro Cavallari, V. Zheng, Hongyun Cai, K. Chang, E. Cambria
In this paper, we study an important yet largely under-explored setting of graph embedding, i.e., embedding communities instead of each individual nodes. We find that community embedding is not only useful for community-level applications such as graph visualization, but also beneficial to both community detection and node classification. To learn such embedding, our insight hinges upon a closed loop among community embedding, community detection and node embedding. On the one hand, node embedding can help improve community detection, which outputs good communities for fitting better community embedding. On the other hand, community embedding can be used to optimize the node embedding by introducing a community-aware high-order proximity. Guided by this insight, we propose a novel community embedding framework that jointly solves the three tasks together. We evaluate such a framework on multiple real-world datasets, and show that it improves graph visualization and outperforms state-of-the-art baselines in various application tasks, e.g., community detection and node classification.
{"title":"Learning Community Embedding with Community Detection and Node Embedding on Graphs","authors":"Sandro Cavallari, V. Zheng, Hongyun Cai, K. Chang, E. Cambria","doi":"10.1145/3132847.3132925","DOIUrl":"https://doi.org/10.1145/3132847.3132925","url":null,"abstract":"In this paper, we study an important yet largely under-explored setting of graph embedding, i.e., embedding communities instead of each individual nodes. We find that community embedding is not only useful for community-level applications such as graph visualization, but also beneficial to both community detection and node classification. To learn such embedding, our insight hinges upon a closed loop among community embedding, community detection and node embedding. On the one hand, node embedding can help improve community detection, which outputs good communities for fitting better community embedding. On the other hand, community embedding can be used to optimize the node embedding by introducing a community-aware high-order proximity. Guided by this insight, we propose a novel community embedding framework that jointly solves the three tasks together. We evaluate such a framework on multiple real-world datasets, and show that it improves graph visualization and outperforms state-of-the-art baselines in various application tasks, e.g., community detection and node classification.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86086319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Posts by users on microblogs such as Twitter provide diverse real-time updates to major events. Unfortunately, not all the information are credible. Previous works that assess the credibility of information in Twitter have focused on extracting features from the Tweets. In this work, we present an interactive framework called iFACT for assessing the credibility of claims from tweets. The proposed framework collects independent evidence from web search results (WSR) and identify the dependencies between claims. It utilizes features from the search results to determine the probabilities that a claim is credible, not credible or inconclusive. Finally, the dependencies between claims are used to adjust the likelihood estimates of a claim being credible, not credible or inconclusive. iFACT allows users to be engaged in the credibility assessment process by providing feedback as to whether the web search results are relevant, support or contradict a claim. Experiment results on multiple real world datasets demonstrate the effectiveness of WSR features and its ability to generalize to claims of new events. Case studies show the usefulness of claim dependencies and how the proposed approach can give explanation to the credibility assessment process.
{"title":"iFACT: An Interactive Framework to Assess Claims from Tweets","authors":"Wee-Yong Lim, M. Lee, W. Hsu","doi":"10.1145/3132847.3132995","DOIUrl":"https://doi.org/10.1145/3132847.3132995","url":null,"abstract":"Posts by users on microblogs such as Twitter provide diverse real-time updates to major events. Unfortunately, not all the information are credible. Previous works that assess the credibility of information in Twitter have focused on extracting features from the Tweets. In this work, we present an interactive framework called iFACT for assessing the credibility of claims from tweets. The proposed framework collects independent evidence from web search results (WSR) and identify the dependencies between claims. It utilizes features from the search results to determine the probabilities that a claim is credible, not credible or inconclusive. Finally, the dependencies between claims are used to adjust the likelihood estimates of a claim being credible, not credible or inconclusive. iFACT allows users to be engaged in the credibility assessment process by providing feedback as to whether the web search results are relevant, support or contradict a claim. Experiment results on multiple real world datasets demonstrate the effectiveness of WSR features and its ability to generalize to claims of new events. Case studies show the usefulness of claim dependencies and how the proposed approach can give explanation to the credibility assessment process.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79364467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meng Hu, Zhixu Li, Yongxin Shen, An Liu, Guanfeng Liu, Kai Zheng, Lei Zhao
Information Extraction by Text Segmentation (IETS) aims at segmenting text inputs to extract implicit data values contained in them.The state-of-art IETS approaches mainly rely on machine learning techniques, either supervised or unsupervised.However, while the supervised approaches require a large labelled training data, the performance of the unsupervised ones could be unstable on different data sets.To overcome their weaknesses, this paper introduces CNN-IETS, a novel unsupervised probabilistic approach that takes the advantages of pre-existing data and a Convolution Neural Network (CNN)-based probabilistic classification model. While using the CNN model can ease the burden of selecting high-quality features in associating text segments with attributes of a given domain, the pre-existing data as a domain knowledge base can provide training data with a comprehensive list of features for building the CNN model.Given an input text, we do initial segmentation (according to the occurrences of these words in the knowledge base) to generate text segments for CNN classification with probabilities. Then, based on the probabilistic CNN classification results, we work on finding the most probable labelling way to the whole input text.As a complementary, a bidirectional sequencing model learned on-demand from test data is finally deployed to do further adjustment to some problematic labelled segments.Our experimental study conducted on several real data collections shows that CNN-IETS improves the extraction quality of state-of-art approaches by more than 10%.
{"title":"CNN-IETS: A CNN-based Probabilistic Approach for Information Extraction by Text Segmentation","authors":"Meng Hu, Zhixu Li, Yongxin Shen, An Liu, Guanfeng Liu, Kai Zheng, Lei Zhao","doi":"10.1145/3132847.3132962","DOIUrl":"https://doi.org/10.1145/3132847.3132962","url":null,"abstract":"Information Extraction by Text Segmentation (IETS) aims at segmenting text inputs to extract implicit data values contained in them.The state-of-art IETS approaches mainly rely on machine learning techniques, either supervised or unsupervised.However, while the supervised approaches require a large labelled training data, the performance of the unsupervised ones could be unstable on different data sets.To overcome their weaknesses, this paper introduces CNN-IETS, a novel unsupervised probabilistic approach that takes the advantages of pre-existing data and a Convolution Neural Network (CNN)-based probabilistic classification model. While using the CNN model can ease the burden of selecting high-quality features in associating text segments with attributes of a given domain, the pre-existing data as a domain knowledge base can provide training data with a comprehensive list of features for building the CNN model.Given an input text, we do initial segmentation (according to the occurrences of these words in the knowledge base) to generate text segments for CNN classification with probabilities. Then, based on the probabilistic CNN classification results, we work on finding the most probable labelling way to the whole input text.As a complementary, a bidirectional sequencing model learned on-demand from test data is finally deployed to do further adjustment to some problematic labelled segments.Our experimental study conducted on several real data collections shows that CNN-IETS improves the extraction quality of state-of-art approaches by more than 10%.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"284 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83432220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a new problem: covering the optimal time window over temporal data. Given a duration constraint d and a set of users where each user has multiple time intervals, the goal is to find all time windows which (1) are greater than or equal to the duration d, and (2) can be covered by the intervals from as many as possible users. This problem can be applied to real scenarios where people need to determine the best time for maximizing the number of people to be involved in an activity, e.g., the meeting organization and the online live video broadcasting. As far as we know, there is no existing algorithm that can solve the problem directly. In this paper, we propose two algorithms to solve the problem, the first one is considered as a baseline algorithm called sliding time window (STW), where we utilize the start and end points of all users - intervals to construct time windows satisfying duration d. And then we calculate the number of users whose intervals can cover the current time window. The second method, named TLI, is designed based on the the data structures from the Timeline Index in SAP HANA. In TLI algorithm, we conduct three consecutive phases to achieve the purpose of efficiency improvement, namely construction of Timeline Index, calculation of valid user set and calculation of time windows. Within the third phase, we prune the number of time windows by keeping track of the number of users in current optimal time window, which can help shrink the search space. Through extensive experimental evaluations, we find TLI algorithm outperforms STW two orders of magnitude in terms of querying time.
{"title":"Covering the Optimal Time Window Over Temporal Data","authors":"Bin Cao, Chenyu Hou, Jing Fan","doi":"10.1145/3132847.3132935","DOIUrl":"https://doi.org/10.1145/3132847.3132935","url":null,"abstract":"In this paper, we propose a new problem: covering the optimal time window over temporal data. Given a duration constraint d and a set of users where each user has multiple time intervals, the goal is to find all time windows which (1) are greater than or equal to the duration d, and (2) can be covered by the intervals from as many as possible users. This problem can be applied to real scenarios where people need to determine the best time for maximizing the number of people to be involved in an activity, e.g., the meeting organization and the online live video broadcasting. As far as we know, there is no existing algorithm that can solve the problem directly. In this paper, we propose two algorithms to solve the problem, the first one is considered as a baseline algorithm called sliding time window (STW), where we utilize the start and end points of all users - intervals to construct time windows satisfying duration d. And then we calculate the number of users whose intervals can cover the current time window. The second method, named TLI, is designed based on the the data structures from the Timeline Index in SAP HANA. In TLI algorithm, we conduct three consecutive phases to achieve the purpose of efficiency improvement, namely construction of Timeline Index, calculation of valid user set and calculation of time windows. Within the third phase, we prune the number of time windows by keeping track of the number of users in current optimal time window, which can help shrink the search space. Through extensive experimental evaluations, we find TLI algorithm outperforms STW two orders of magnitude in terms of querying time.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"SE-11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84638126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Extractive text summarization has been an extensive research problem in the field of natural language understanding. While the conventional approaches rely mostly on manually compiled features to generate the summary, few attempts have been made in developing data-driven systems for extractive summarization. To this end, we present a fully data-driven end-to-end deep network which we call as Hybrid MemNet for single document summarization task. The network learns the continuous unified representation of a document before generating its summary. It jointly captures local and global sentential information along with the notion of summary worthy sentences. Experimental results on two different corpora confirm that our model shows significant performance gains compared with the state-of-the-art baselines.
{"title":"Hybrid MemNet for Extractive Summarization","authors":"A. Singh, Manish Gupta, Vasudeva Varma","doi":"10.1145/3132847.3133127","DOIUrl":"https://doi.org/10.1145/3132847.3133127","url":null,"abstract":"Extractive text summarization has been an extensive research problem in the field of natural language understanding. While the conventional approaches rely mostly on manually compiled features to generate the summary, few attempts have been made in developing data-driven systems for extractive summarization. To this end, we present a fully data-driven end-to-end deep network which we call as Hybrid MemNet for single document summarization task. The network learns the continuous unified representation of a document before generating its summary. It jointly captures local and global sentential information along with the notion of summary worthy sentences. Experimental results on two different corpora confirm that our model shows significant performance gains compared with the state-of-the-art baselines.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79545684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LinkedIn as a professional network serves the career needs of 450 Million plus members. The task of job recommendation system is to nd the suitable job among a corpus of several million jobs and serve this in real time under tight latency constraints. Job search involves nding suitable job listings given a user, query and context. Typical scoring function for both search and recommendations involves evaluating a function that matches various elds in the job description with various elds in the member pro le. This in turn translates to evaluating a function with several thousands of features to get the right ranking. In recommendations, evaluating all the jobs in the corpus for all members is not possible given the latency constraints. On the other hand, reducing the candidate set could potentially involve loss of relevant jobs. We present a way to model the underlying complex ranking function via decision trees. The branches within the decision trees are query clauses and hence the decision trees can be mapped on to real time queries. We developed an o ine framework which evaluates the quality of the decision tree with respect to latency and recall. We tested the approach on job search and recommendations on LinkedIn and A/B tests show signi cant improvements in member engagement and latency. Our techniques helped reduce job search latency by over 67% and our recommendations latency by over 55%. Our techniques show 3.5% improvement in applications from job recommendations primarily due to reduced timeouts from upstream services. As of writing the approach powers all of job search and recommendations on LinkedIn.
{"title":"Latency Reduction via Decision Tree Based Query Construction","authors":"Aman Grover, Dhruv Arya, Ganesh Venkataraman","doi":"10.1145/3132847.3132865","DOIUrl":"https://doi.org/10.1145/3132847.3132865","url":null,"abstract":"LinkedIn as a professional network serves the career needs of 450 Million plus members. The task of job recommendation system is to nd the suitable job among a corpus of several million jobs and serve this in real time under tight latency constraints. Job search involves nding suitable job listings given a user, query and context. Typical scoring function for both search and recommendations involves evaluating a function that matches various elds in the job description with various elds in the member pro le. This in turn translates to evaluating a function with several thousands of features to get the right ranking. In recommendations, evaluating all the jobs in the corpus for all members is not possible given the latency constraints. On the other hand, reducing the candidate set could potentially involve loss of relevant jobs. We present a way to model the underlying complex ranking function via decision trees. The branches within the decision trees are query clauses and hence the decision trees can be mapped on to real time queries. We developed an o ine framework which evaluates the quality of the decision tree with respect to latency and recall. We tested the approach on job search and recommendations on LinkedIn and A/B tests show signi cant improvements in member engagement and latency. Our techniques helped reduce job search latency by over 67% and our recommendations latency by over 55%. Our techniques show 3.5% improvement in applications from job recommendations primarily due to reduced timeouts from upstream services. As of writing the approach powers all of job search and recommendations on LinkedIn.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85309641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Cristo, Raíza Hanada, A. Carvalho, Fernando Anglada Lores, M. G. Pimentel
Word recognition is a challenging task faced by many applications, specially in very noisy scenarios. This problem is usually seen as the transmission of a word through a noisy-channel, such that it is necessary to determine which known word of a lexicon is the received string. To be feasible, just a reduced set of candidate words are selected. They are usually chosen if they can be transformed into the input string by applying up to k character edit operations. To rank the candidates, the most effective estimates use domain knowledge about noise sources and error distributions, extracted from real use data. In scenarios with much noise, however, such estimates, and the index strategies normally required, do not scale well as they grow exponentially with k and the lexicon size. In this work, we propose very efficient methods for word recognition in very noisy scenarios which support effective edit-based distance algorithms in a Mor-Fraenkel index, searchable using a minimum perfect hashing. The method allows the early processing of most promising candidates, such that fast pruned searches present negligible loss in word ranking quality. We also propose a linear heuristic for estimating edit-based distances which take advantage of information already provided by the index. Our methods achieve precision similar to a state-of-the-art approach, being about ten times faster.
{"title":"Fast Word Recognition for Noise channel-based Models in Scenarios with Noise Specific Domain Knowledge","authors":"Marco Cristo, Raíza Hanada, A. Carvalho, Fernando Anglada Lores, M. G. Pimentel","doi":"10.1145/3132847.3133028","DOIUrl":"https://doi.org/10.1145/3132847.3133028","url":null,"abstract":"Word recognition is a challenging task faced by many applications, specially in very noisy scenarios. This problem is usually seen as the transmission of a word through a noisy-channel, such that it is necessary to determine which known word of a lexicon is the received string. To be feasible, just a reduced set of candidate words are selected. They are usually chosen if they can be transformed into the input string by applying up to k character edit operations. To rank the candidates, the most effective estimates use domain knowledge about noise sources and error distributions, extracted from real use data. In scenarios with much noise, however, such estimates, and the index strategies normally required, do not scale well as they grow exponentially with k and the lexicon size. In this work, we propose very efficient methods for word recognition in very noisy scenarios which support effective edit-based distance algorithms in a Mor-Fraenkel index, searchable using a minimum perfect hashing. The method allows the early processing of most promising candidates, such that fast pruned searches present negligible loss in word ranking quality. We also propose a linear heuristic for estimating edit-based distances which take advantage of information already provided by the index. Our methods achieve precision similar to a state-of-the-art approach, being about ten times faster.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90787331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multileaved comparison methods generalize interleaved comparison methods to provide a scalable approach for comparing ranking systems based on regular user interactions. Such methods enable the increasingly rapid research and development of search engines. However, existing multileaved comparison methods that provide reliable outcomes do so by degrading the user experience during evaluation. Conversely, current multileaved comparison methods that maintain the user experience cannot guarantee correctness. Our contribution is two-fold. First, we propose a theoretical framework for systematically comparing multileaved comparison methods using the notions of considerateness, which concerns maintaining the user experience, and fidelity, which concerns reliable correct outcomes. Second, we introduce a novel multileaved comparison method, Pairwise Preference Multileaving (PPM), that performs comparisons based on document-pair preferences, and prove that it is considerate and has fidelity. We show empirically that, compared to previous multileaved comparison methods, PPM is more sensitive to user preferences and scalable with the number of rankers being compared.
{"title":"Sensitive and Scalable Online Evaluation with Theoretical Guarantees","authors":"Harrie Oosterhuis, M. de Rijke","doi":"10.1145/3132847.3132895","DOIUrl":"https://doi.org/10.1145/3132847.3132895","url":null,"abstract":"Multileaved comparison methods generalize interleaved comparison methods to provide a scalable approach for comparing ranking systems based on regular user interactions. Such methods enable the increasingly rapid research and development of search engines. However, existing multileaved comparison methods that provide reliable outcomes do so by degrading the user experience during evaluation. Conversely, current multileaved comparison methods that maintain the user experience cannot guarantee correctness. Our contribution is two-fold. First, we propose a theoretical framework for systematically comparing multileaved comparison methods using the notions of considerateness, which concerns maintaining the user experience, and fidelity, which concerns reliable correct outcomes. Second, we introduce a novel multileaved comparison method, Pairwise Preference Multileaving (PPM), that performs comparisons based on document-pair preferences, and prove that it is considerate and has fidelity. We show empirically that, compared to previous multileaved comparison methods, PPM is more sensitive to user preferences and scalable with the number of rankers being compared.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91270613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}