Proceedings Eighth Symposium on String Processing and Information Retrieval最新文献

英文中文

A subquadratic algorithm for cluster and outlier detection in massive metric data 海量度量数据中聚类和离群点检测的次二次算法

Proceedings Eighth Symposium on String Processing and Information Retrieval

Pub Date : 1900-01-01 DOI: 10.1109/SPIRE.2001.10018

Edgar Chávez

The problem of cluster and outlier detection is a classic problem of non-parametric statistics. In recent times the need for cluster analysis in massive multimedia data sets (terabytes of data sampled from a metric space) have demonstrated the need for solutions both in the sense of being capable of automatic clustering metric data and at reasonable speed. Since cluster properties involve the relationship between each pair of data set elements, a good clustering algorithm must examine (in principle) every distance pair and hence has quadratic complexity. An appealing trend to achieve subquadratic complexity is either a) to use an approximation for a classic clustering algorithm or b) to design a new algorithm for clustering. This paper presents a new clustering algorithm performing O(n1+α) distance computations (the operation ofleading complexity), with 0 ⩽ α ⩽ 1 a constant depending on the intrinsic dimension of the sample data. The algorithm can detect outliers in the sample data and, if desired, it can produce a hierarchical structure (a dendogram) pointing to clusters at different resolutions.

聚类和离群点检测问题是非参数统计的一个经典问题。最近，对大量多媒体数据集(从度量空间采样的tb级数据)进行聚类分析的需求表明，需要能够以合理的速度自动聚类度量数据的解决方案。由于聚类属性涉及每对数据集元素之间的关系，一个好的聚类算法必须检查(原则上)每个距离对，因此具有二次复杂度。实现次二次复杂度的一个吸引人的趋势是a)使用经典聚类算法的近似值或b)设计一个新的聚类算法。本文提出了一种新的聚类算法，它执行O(n1+α)距离计算(领先复杂度的运算)，根据样本数据的内在维数以0≥α≤1为常数。该算法可以检测样本数据中的异常值，如果需要，它可以生成一个层次结构(树形图)，指向不同分辨率的集群。

引用次数: 0

Relating web characteristics with link based web page ranking 将网页特征与基于链接的网页排名联系起来

Proceedings Eighth Symposium on String Processing and Information Retrieval

Pub Date : 1900-01-01 DOI: 10.1109/SPIRE.2001.989734

Ricardo Baeza-Yates, C. Castillo

In the last years, several techniques based in link analysis have been proposed and used in search engines to rank Web pages. As links are generated by humans, link based ranking seems to give better results than traditional automatic techniques such as word based ranking. However, no studies have been done about their real impact. In this paper we extend global page ranking techniques to Web site ranking, and do a first experimental analysis of link ranking regarding the structure and dynamics of the Web.

在过去的几年里，一些基于链接分析的技术被提出并用于搜索引擎对网页进行排名。由于链接是由人类生成的，基于链接的排名似乎比传统的自动技术(如基于单词的排名)给出更好的结果。然而，还没有关于它们真正影响的研究。本文将全局页面排名技术扩展到网站排名中，并首次从网络结构和动态的角度对链接排名进行了实验分析。

引用次数: 43

On-line construction of symmetric compact directed acyclic word graphs 对称紧致有向无环字图的在线构造

Proceedings Eighth Symposium on String Processing and Information Retrieval

Pub Date : 1900-01-01 DOI: 10.1109/SPIRE.2001.989743

Shunsuke Inenaga, H. Hoshino, A. Shinohara, M. Takeda, S. Arikawa

The Compact Directed Acyclic Word Graph (CDAWG) is a space-eflcient data structure that supports indices of a string. The Symmetric Directed Acyclic Word Graph (SCDAWG) for a string w is a dual structure that supports indices of both w and the reverse of w simultaneously. Blumer et al. gave the first algorithm to construct an SCDAWG from a given string, that works in an of-line manner. In this papec we show an on-line algorithm that constructs an SCDAWGfiom a given string directly.

紧凑有向无环字图(Compact Directed Acyclic Word Graph, CDAWG)是一种空间高效的数据结构，支持字符串的索引。字符串w的对称有向无环字图(SCDAWG)是一个双重结构，它同时支持w的索引和w的倒序索引。Blumer等人给出了第一个从给定字符串构造SCDAWG的算法，该算法以联机方式工作。本文给出了一种直接从给定字符串构造scdawgf的在线算法。

引用次数: 12

Using edit distance in point-pattern matching 在点模式匹配中使用编辑距离

Proceedings Eighth Symposium on String Processing and Information Retrieval

Pub Date : 1900-01-01 DOI: 10.1109/SPIRE.2001.989751

V. Makinen

Edit distance is a powerful measure of similarity in string matching, measuring the minimum amount of insertions, deletions, and substitutions to convert a string into another string. This measure is ofte. contrasted with time warping in speech processing, that measures how close two trajectories are by allowing compression and expansion operations on time scale. Erne warping can be easily generalized to measure the similarity between ID point-patterns (ascending lists of real values), as the diference between ith and (i l ) th points in a point-pattern can be considered as the value of a trajectory at the time i. Howeve< we show that edit distance is more natural choice, and derive a measure by calculating the minimum amount of space needed to insert and delete between points to convert a point-pattern into another. We show that this measure defines a metric. We also define a substitution operation such that the distance calculation automatically separates the points into matching and mismatching points. The algorithms are based on dynamic programming. The main motivation for these methods is two and higher dimensional point-pattern matching, and therefore we generalize these methods into the 2 0 case, and show that this generalization leads to an NP-complete problem. There is also applications for the I D case; we discuss shortly the matching of tree ring sequences in dendrochronology.

编辑距离是字符串匹配中的一种强大的相似性度量，它测量将一个字符串转换为另一个字符串所需的插入、删除和替换的最小数量。这是常用的方法。与语音处理中的时间扭曲相比，它通过允许在时间尺度上进行压缩和扩展操作来测量两条轨迹的接近程度。Erne扭曲可以很容易地推广到测量ID点模式(实值的升序列表)之间的相似性，因为点模式中第i个点与第i个点之间的差值可以被认为是时刻i的轨迹值。然而，我们表明编辑距离是更自然的选择，并通过计算插入和删除点之间所需的最小空间量来导出度量。我们证明这个度量定义了一个度量。我们还定义了替换操作，使距离计算自动将点划分为匹配点和不匹配点。算法是基于动态规划的。这些方法的主要动机是二维和高维点模式匹配，因此我们将这些方法推广到20的情况，并表明这种推广导致np完全问题。也有申请身份证的个案;简要讨论了树木年轮序列在树木年代学中的匹配问题。

{"title":"Using edit distance in point-pattern matching","authors":"V. Makinen","doi":"10.1109/SPIRE.2001.989751","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989751","url":null,"abstract":"Edit distance is a powerful measure of similarity in string matching, measuring the minimum amount of insertions, deletions, and substitutions to convert a string into another string. This measure is ofte. contrasted with time warping in speech processing, that measures how close two trajectories are by allowing compression and expansion operations on time scale. Erne warping can be easily generalized to measure the similarity between ID point-patterns (ascending lists of real values), as the diference between ith and (i l ) th points in a point-pattern can be considered as the value of a trajectory at the time i. Howeve< we show that edit distance is more natural choice, and derive a measure by calculating the minimum amount of space needed to insert and delete between points to convert a point-pattern into another. We show that this measure defines a metric. We also define a substitution operation such that the distance calculation automatically separates the points into matching and mismatching points. The algorithms are based on dynamic programming. The main motivation for these methods is two and higher dimensional point-pattern matching, and therefore we generalize these methods into the 2 0 case, and show that this generalization leads to an NP-complete problem. There is also applications for the I D case; we discuss shortly the matching of tree ring sequences in dendrochronology.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127863073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

On compression of parse trees 关于解析树的压缩

Proceedings Eighth Symposium on String Processing and Information Retrieval

Pub Date : 1900-01-01 DOI: 10.1109/SPIRE.2001.989759

J. Tarhio

We consider methods for compressing parse trees, especially techniques based on statistical modeling. We regard a sequence of productions corresponding to a sum of the path from the root of a tree to a node x as the context of a node x. The contexts are augmented with branching information of the nodes. By applying the text compression algorithm PPMon such contexts we achieve good compression results. We compare experimentally the PPMapproach with other methods.

我们考虑压缩解析树的方法，特别是基于统计建模的技术。我们把从树的根到节点x的路径的和对应的一个产出序列作为节点x的上下文。上下文被节点的分支信息扩充。在这种情况下，应用PPMon文本压缩算法，取得了较好的压缩效果。我们将ppm方法与其他方法进行了实验比较。

引用次数: 10

Of maps bigger than the empire 比帝国还大的地图

Proceedings Eighth Symposium on String Processing and Information Retrieval

Pub Date : 1900-01-01 DOI: 10.1109/SPIRE.2001.989732

A. Apostolico

In a passage by J.L. Borges on the "exactitude of Science," a fictitious author describes an Empire in which the art of Cartography "logro tal perfeccion que el mapa de una sola Provincia ocupaba toda la Ciudad, y el mapa del Imperio toda una Provincia." With time, these huge maps wouldn't be enough, and the Colleges of the Cartographers erected a map of the Empire that equalled in width the Empire itself... This paper concerns itself with increasing cases of pattern discovery and data mining in which synopses, indices and relationships thereof seem to grow faster and bigger than the phenomena they were meant to encapsulate. The paper then reviews specific examples of algorithmic and combinatorial constructs that proved capable of alleviating such paradoxes in the author's recent work experience.

在博尔赫斯(J.L. Borges)关于“科学的精确性”的一段话中，一位虚构的作者描述了一个帝国，在这个帝国中，制图艺术“完全完美地掌握了一个省的地图，它占领了一个城市，它掌握了一个省的地图，它掌握了一个省的地图”。随着时间的推移，这些巨大的地图已经不够用了，制图师学院绘制了一幅帝国地图，其宽度与帝国本身相当……本文关注的是越来越多的模式发现和数据挖掘案例，在这些案例中，概要、索引和它们之间的关系似乎比它们所要封装的现象增长得更快、更大。然后，论文回顾了算法和组合结构的具体例子，这些例子在作者最近的工作经验中被证明能够缓解这种悖论。

引用次数: 4

Semantic labeling - unveiling the main components of meaning of free-text 语义标注——揭示自由文本意义的主要成分

Proceedings Eighth Symposium on String Processing and Information Retrieval

Pub Date : 1900-01-01 DOI: 10.1109/SPIRE.2001.10027

Y. Zieman, R. Salas

An experimentally proven methodology for computing semantic labels for natural language and its use in semantic processing of text is described. A combinatorial model of the conceptual space is created where semantic labels result as combinations ofprimary or atomic concepts called Semantic Factors. The set of about 2,500 Semantic Factors is defined. The basic semantic element of a language is a morpheme-type element (s-morpheme), the minimalpart ofa language that bears its own meaning. All s-morphemes in the Knowledge Base (about 15,000 for English) are labeled. The label for a phrase (its ¿Concept Codel7 results as a combination of the labels for the smorphemes constituting it. Algorithms are designed to identify the s-morphemes in a phrase and to generate the phrase¿s Concept Code. The matching procedure compares Concept Codes and identifies conceptually close ones - those sharing a maximal number of Semantic Factors. Similarity is identified here as a match between the Concept Codes of two Text objects. Since a Concept Code is essentially language independent, this technology is appropriate for implementation in cross-language applications. An example is described of an application in the bio-medical domain, where documents of a database of more than 12 million titles are being successfully retrieved in about 50% of the queries normally rejected by traditional search methods.

本文描述了一种经过实验验证的自然语言语义标签计算方法及其在文本语义处理中的应用。创建概念空间的组合模型，其中语义标签作为称为语义因素的主要或原子概念的组合。定义了大约2500个语义因子的集合。语言的基本语义元素是语素类型元素(s-morpheme)，它是语言中具有自身意义的最小部分。知识库中所有的s-语素(英语约15000个)都有标记。短语的标签(它的概念代码7)是组成短语的同义素标签的组合。设计算法来识别短语中的s-语素并生成短语的概念码。匹配过程比较概念代码并识别概念上接近的代码-那些共享最大数量语义因素的代码。相似性在这里被定义为两个文本对象的概念代码之间的匹配。由于概念代码本质上是独立于语言的，因此该技术适合在跨语言应用程序中实现。本文描述了生物医学领域的一个应用程序示例，其中在传统搜索方法通常拒绝的约50%的查询中，成功检索了数据库中超过1200万个标题的文档。

{"title":"Semantic labeling - unveiling the main components of meaning of free-text","authors":"Y. Zieman, R. Salas","doi":"10.1109/SPIRE.2001.10027","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.10027","url":null,"abstract":"An experimentally proven methodology for computing semantic labels for natural language and its use in semantic processing of text is described. A combinatorial model of the conceptual space is created where semantic labels result as combinations ofprimary or atomic concepts called Semantic Factors. The set of about 2,500 Semantic Factors is defined. The basic semantic element of a language is a morpheme-type element (s-morpheme), the minimalpart ofa language that bears its own meaning. All s-morphemes in the Knowledge Base (about 15,000 for English) are labeled. The label for a phrase (its ¿Concept Codel7 results as a combination of the labels for the smorphemes constituting it. Algorithms are designed to identify the s-morphemes in a phrase and to generate the phrase¿s Concept Code. The matching procedure compares Concept Codes and identifies conceptually close ones - those sharing a maximal number of Semantic Factors. Similarity is identified here as a match between the Concept Codes of two Text objects. Since a Concept Code is essentially language independent, this technology is appropriate for implementation in cross-language applications. An example is described of an application in the bio-medical domain, where documents of a database of more than 12 million titles are being successfully retrieved in about 50% of the queries normally rejected by traditional search methods.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"190 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132747928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings Eighth Symposium on String Processing and Information Retrieval

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀