Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation最新文献

英文中文

A Comparative Study of Named Entity Recognition for Telugu 泰卢固语命名实体识别的比较研究

Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation

Pub Date : 2017-12-08 DOI: 10.1145/3158354.3158358

SaiKiranmai Gorla, N. B. Murthy, Aruna Malapati

In this paper, we apply three classification learning algorithms to Telugu Named Entity Recognition (NER) task and we present a comparative study between these three learning algorithms on Telugu dataset (NER for South and South-East Asian Languages (NERSSEAL) Competition). The empirical results show that Support Vector Machine achieves the best F-measure of 54.78% on the dataset.

在本文中，我们将三种分类学习算法应用于泰卢固语命名实体识别(NER)任务，并在泰卢固语数据集(南亚和东南亚语言NER (NERSSEAL)竞赛)上对这三种学习算法进行了比较研究。实证结果表明，支持向量机在该数据集上达到了54.78%的最佳f度量值。

引用次数: 2

Feature Space of Deep Learning and its Importance: Comparison of Clustering Techniques on the Extended Space of ML-ELM 深度学习的特征空间及其重要性:ML-ELM扩展空间上聚类技术的比较

Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation

Pub Date : 2017-12-08 DOI: 10.1145/3158354.3158359

R. Roul, Amit Agarwal

Based on the architecture of deep learning, Multilayer Extreme Learning Machine (ML-ELM) has many good characteristics which make it distinct and widespread classifier in the domain of text mining. Some of its salient features include non-linear mapping of features into a high dimensional space, high level of data abstraction, no backpropagation, higher rate of learning etc. This paper studies the importance of ML-ELM feature space and tested the performance of various traditional clustering techniques on this feature space. Empirical results show the efficiency and effectiveness of the feature space of ML-ELM compared to TF-IDF vector space which justifies the prominence of deep learning.

基于深度学习体系结构的多层极限学习机(ML-ELM)具有许多良好的特性，使其在文本挖掘领域具有独特的应用前景。它的一些显著特征包括特征到高维空间的非线性映射、高水平的数据抽象、无反向传播、更高的学习速率等。本文研究了ML-ELM特征空间的重要性，并测试了各种传统聚类技术在该特征空间上的性能。实证结果表明，与TF-IDF向量空间相比，ML-ELM特征空间的效率和有效性证明了深度学习的重要性。

引用次数: 3

A Comparison of Automatic Search Query Enhancement Algorithms That Utilise Wikipedia as a Source of A Priori Knowledge 利用维基百科作为先验知识来源的自动搜索查询增强算法的比较

Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation

Pub Date : 2017-12-08 DOI: 10.1145/3158354.3158356

Kyle Goslin, M. Hofmann

This paper describes the benchmarking and analysis of five Automatic Search Query Enhancement (ASQE) algorithms that utilise Wikipedia as the sole source for a priori knowledge. The contributions of this paper include: 1) A comprehensive review into current ASQE algorithms that utilise Wikipedia as the sole source for a priori knowledge; 2) benchmarking of five existing ASQE algorithms using the TREC-9 Web Topics on the ClueWeb12 data set and 3) analysis of the results from the benchmarking process to identify the strengths and weaknesses each algorithm. During the benchmarking process, 2,500 relevance assessments were performed. Results of these tests are analysed using the Average Precision @10 per query and Mean Average Precision @10 per algorithm. From this analysis we show that the scope of a priori knowledge utilised during enhancement and the available term weighting methods available from Wikipedia can further aid the ASQE process. Although approaches taken by the algorithms are still relevant, an over dependence on weighting schemes and data sources used can easily impact results of an ASQE algorithm.

本文描述了五种自动搜索查询增强(ASQE)算法的基准测试和分析，这些算法利用维基百科作为先验知识的唯一来源。本文的贡献包括:1)对当前使用维基百科作为先验知识唯一来源的ASQE算法进行了全面回顾;2)在ClueWeb12数据集上使用TREC-9 Web Topics对现有的五种ASQE算法进行基准测试;3)分析基准测试过程的结果，确定每种算法的优缺点。在对标过程中，进行了2500次相关评估。使用每个查询的平均精度@10和每个算法的平均精度@10来分析这些测试的结果。从这个分析中，我们发现在增强过程中使用的先验知识的范围和维基百科中可用的术语加权方法可以进一步帮助ASQE过程。尽管算法所采用的方法仍然是相关的，但过度依赖所使用的加权方案和数据源很容易影响ASQE算法的结果。

{"title":"A Comparison of Automatic Search Query Enhancement Algorithms That Utilise Wikipedia as a Source of A Priori Knowledge","authors":"Kyle Goslin, M. Hofmann","doi":"10.1145/3158354.3158356","DOIUrl":"https://doi.org/10.1145/3158354.3158356","url":null,"abstract":"This paper describes the benchmarking and analysis of five Automatic Search Query Enhancement (ASQE) algorithms that utilise Wikipedia as the sole source for a priori knowledge. The contributions of this paper include: 1) A comprehensive review into current ASQE algorithms that utilise Wikipedia as the sole source for a priori knowledge; 2) benchmarking of five existing ASQE algorithms using the TREC-9 Web Topics on the ClueWeb12 data set and 3) analysis of the results from the benchmarking process to identify the strengths and weaknesses each algorithm. During the benchmarking process, 2,500 relevance assessments were performed. Results of these tests are analysed using the Average Precision @10 per query and Mean Average Precision @10 per algorithm. From this analysis we show that the scope of a priori knowledge utilised during enhancement and the available term weighting methods available from Wikipedia can further aid the ASQE process. Although approaches taken by the algorithms are still relevant, an over dependence on weighting schemes and data sources used can easily impact results of an ASQE algorithm.","PeriodicalId":306212,"journal":{"name":"Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115631278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Segmentation of Merged Lines and Script Identification in Handwritten Bilingual Documents 手写体双语文档中合并行分割与文字识别

Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation

Pub Date : 2017-12-08 DOI: 10.1145/3158354.3158360

Ranjana S. Zinjore, R. Ramteke, Varsha M. Pathak

Text line segmentation is a challenging task in Optical Character Recognition, due to writing style of writers and touching characters or Matra between lines. In this paper, we have proposed an algorithm for dividing the merged lines into individual multiple lines from Handwritten Bilingual (Marathi-English) documents. The algorithm is tested on different images; we have obtained promising results. Afterward, script is identifying at word level using fusion of moment based features and visual discriminating features. Two different classifiers are evaluated on a dataset consisting of 242 Marathi-English words for training and 82 words for testing. We have received average identification accuracy of 67% in K-NN classifier and 80.14% in SVM classifier.

文本行分割是光学字符识别中的一项具有挑战性的任务，由于写作者的书写风格和行与行之间的接触字符或矩阵。在本文中，我们提出了一种算法，用于将手写双语(马拉地语-英语)文档中的合并行划分为单独的多行。在不同的图像上对算法进行了测试;我们取得了可喜的成果。然后，利用基于矩的特征和视觉识别特征的融合在词级进行识别。在包含242个用于训练的马拉地英语单词和82个用于测试的单词的数据集上评估两种不同的分类器。K-NN分类器的平均识别准确率为67%，SVM分类器的平均识别准确率为80.14%。

引用次数: 2

Language Identification in Mixed Script 混合文字中的语言识别

Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation

Pub Date : 2017-12-08 DOI: 10.1145/3158354.3158357

Nagesh Bhattu Sristy, N. S. Krishna, B. S. Krishna, V. Ravi

The text exchanged in social media conversations is often noisy with a mixture of stylistic and misspelt variations of original words. Any standard NLP techniques applied on such data such as POS tagging, Named entity recognition suffer because of noisy nature of the input. Usage of mixed script text is also prevalent in social media users. The current work addresses the identification of language at word level in mixed script scenarios, where all the text is written in roman script but the words being used by the users are transliterations of original words in native language into english. The core part of the problem is identifying the language, looking at small fragments of text among a set of languages. We propose a two stage approach for word-level language identification. In the first stage a mixing language combination is identified by using character n-grams of the sentence. Second stage consists of using the previous mixing combination class to make the word level language identification. We apply Conditional Random Fields(CRF) further in second stage to improve the performance of the word level language identification. Such simplification is essential, otherwise the number of states of the model will be huge and resultant model predictions are very noisy. Our methods improve the F-score of word level language identification by over 10% compared to the base-line.

在社交媒体上交流的文本通常是嘈杂的，夹杂着原词的风格和拼写错误。任何应用于此类数据的标准NLP技术，如POS标记、命名实体识别，都因为输入的噪声性质而受到影响。混合脚本文本的使用在社交媒体用户中也很普遍。目前的工作解决了在混合脚本场景中单词级别的语言识别，其中所有文本都用罗马字母书写，但用户使用的单词是将母语中的原始单词音译为英语。问题的核心部分是识别语言，在一组语言中查看文本的小片段。我们提出了一种两阶段的词级语言识别方法。在第一阶段，使用句子的字符n图来识别混合语言组合。第二阶段是利用前面的混合组合类进行词级语言识别。在第二阶段，我们进一步应用条件随机场(CRF)来提高词级语言识别的性能。这种简化是必要的，否则模型的状态数将是巨大的，结果模型预测是非常嘈杂的。与基线相比，我们的方法将词级语言识别的f分数提高了10%以上。

{"title":"Language Identification in Mixed Script","authors":"Nagesh Bhattu Sristy, N. S. Krishna, B. S. Krishna, V. Ravi","doi":"10.1145/3158354.3158357","DOIUrl":"https://doi.org/10.1145/3158354.3158357","url":null,"abstract":"The text exchanged in social media conversations is often noisy with a mixture of stylistic and misspelt variations of original words. Any standard NLP techniques applied on such data such as POS tagging, Named entity recognition suffer because of noisy nature of the input. Usage of mixed script text is also prevalent in social media users. The current work addresses the identification of language at word level in mixed script scenarios, where all the text is written in roman script but the words being used by the users are transliterations of original words in native language into english. The core part of the problem is identifying the language, looking at small fragments of text among a set of languages. We propose a two stage approach for word-level language identification. In the first stage a mixing language combination is identified by using character n-grams of the sentence. Second stage consists of using the previous mixing combination class to make the word level language identification. We apply Conditional Random Fields(CRF) further in second stage to improve the performance of the word level language identification. Such simplification is essential, otherwise the number of states of the model will be huge and resultant model predictions are very noisy. Our methods improve the F-score of word level language identification by over 10% compared to the base-line.","PeriodicalId":306212,"journal":{"name":"Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124529960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation

Pub Date : 2017-12-08 DOI: 10.1145/3158354.3158355

Anirban Sen, Manjira Sinha, Sandya Mannarswamy

Collective intelligence of the crowds is distilled together in various Community Question Answering (CQA) Services such as Quora, Yahoo Answers, Stack Overflow forums, wherein users share their knowledge, providing both informational and experiential support to other users. As users often search for similar information, probabilities are high that for a new incoming question, there is a related question-answer pair existing in the CQA dataset. Therefore, an efficient technique for similar question identification is need of the hour. While data is not a bottleneck in this scenario, addressing the vocabulary diversity generated by a variety pool of users certainly is. This paper proposes a novel tripartite neural network based approach towards the similar question retrieval problem. The network takes inputs in the form of question-answer and new question triplet and learns internal representations from similarities among them. Our approach achieves classification performances upto 77% on a real world CQA dataset.We have also compared our method with two other baselines and found that it performs significantly better in handling the problem of vocabulary diversity and 'zero-lexical overlap' among questions.

人群的集体智慧在各种社区问答(CQA)服务(如Quora、雅虎问答、Stack Overflow论坛)中被提炼出来，用户在其中分享他们的知识，为其他用户提供信息和经验支持。由于用户经常搜索相似的信息，对于一个新的输入问题，CQA数据集中存在相关的问答对的概率很高。因此，迫切需要一种有效的相似问题识别技术。虽然数据不是这个场景中的瓶颈，但解决由各种用户池生成的词汇表多样性肯定是瓶颈。本文提出了一种基于三方神经网络的相似问题检索方法。该网络以问答和新问题三元组的形式输入，并从它们之间的相似性中学习内部表征。我们的方法在真实世界的CQA数据集上实现了高达77%的分类性能。我们还将我们的方法与另外两个基线进行了比较，发现它在处理词汇多样性和问题之间的“零词汇重叠”问题方面表现得明显更好。

{"title":"Improving Similar Question Retrieval using a Novel Tripartite Neural Network based Approach","authors":"Anirban Sen, Manjira Sinha, Sandya Mannarswamy","doi":"10.1145/3158354.3158355","DOIUrl":"https://doi.org/10.1145/3158354.3158355","url":null,"abstract":"Collective intelligence of the crowds is distilled together in various Community Question Answering (CQA) Services such as Quora, Yahoo Answers, Stack Overflow forums, wherein users share their knowledge, providing both informational and experiential support to other users. As users often search for similar information, probabilities are high that for a new incoming question, there is a related question-answer pair existing in the CQA dataset. Therefore, an efficient technique for similar question identification is need of the hour. While data is not a bottleneck in this scenario, addressing the vocabulary diversity generated by a variety pool of users certainly is. This paper proposes a novel tripartite neural network based approach towards the similar question retrieval problem. The network takes inputs in the form of question-answer and new question triplet and learns internal representations from similarities among them. Our approach achieves classification performances upto 77% on a real world CQA dataset.We have also compared our method with two other baselines and found that it performs significantly better in handling the problem of vocabulary diversity and 'zero-lexical overlap' among questions.","PeriodicalId":306212,"journal":{"name":"Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121387328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation 第九届信息检索评估论坛年会论文集

Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation

Pub Date : 1900-01-01 DOI: 10.1145/3158354

引用次数: 0

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀