Pub Date : 1900-01-01DOI: 10.1109/SPIRE.2001.10018
Edgar Chávez
The problem of cluster and outlier detection is a classic problem of non-parametric statistics. In recent times the need for cluster analysis in massive multimedia data sets (terabytes of data sampled from a metric space) have demonstrated the need for solutions both in the sense of being capable of automatic clustering metric data and at reasonable speed. Since cluster properties involve the relationship between each pair of data set elements, a good clustering algorithm must examine (in principle) every distance pair and hence has quadratic complexity. An appealing trend to achieve subquadratic complexity is either a) to use an approximation for a classic clustering algorithm or b) to design a new algorithm for clustering. This paper presents a new clustering algorithm performing O(n1+α) distance computations (the operation ofleading complexity), with 0 ⩽ α ⩽ 1 a constant depending on the intrinsic dimension of the sample data. The algorithm can detect outliers in the sample data and, if desired, it can produce a hierarchical structure (a dendogram) pointing to clusters at different resolutions.
{"title":"A subquadratic algorithm for cluster and outlier detection in massive metric data","authors":"Edgar Chávez","doi":"10.1109/SPIRE.2001.10018","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.10018","url":null,"abstract":"The problem of cluster and outlier detection is a classic problem of non-parametric statistics. In recent times the need for cluster analysis in massive multimedia data sets (terabytes of data sampled from a metric space) have demonstrated the need for solutions both in the sense of being capable of automatic clustering metric data and at reasonable speed. Since cluster properties involve the relationship between each pair of data set elements, a good clustering algorithm must examine (in principle) every distance pair and hence has quadratic complexity. An appealing trend to achieve subquadratic complexity is either a) to use an approximation for a classic clustering algorithm or b) to design a new algorithm for clustering. This paper presents a new clustering algorithm performing O(n1+α) distance computations (the operation ofleading complexity), with 0 ⩽ α ⩽ 1 a constant depending on the intrinsic dimension of the sample data. The algorithm can detect outliers in the sample data and, if desired, it can produce a hierarchical structure (a dendogram) pointing to clusters at different resolutions.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130567924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/SPIRE.2001.989734
Ricardo Baeza-Yates, C. Castillo
In the last years, several techniques based in link analysis have been proposed and used in search engines to rank Web pages. As links are generated by humans, link based ranking seems to give better results than traditional automatic techniques such as word based ranking. However, no studies have been done about their real impact. In this paper we extend global page ranking techniques to Web site ranking, and do a first experimental analysis of link ranking regarding the structure and dynamics of the Web.
{"title":"Relating web characteristics with link based web page ranking","authors":"Ricardo Baeza-Yates, C. Castillo","doi":"10.1109/SPIRE.2001.989734","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989734","url":null,"abstract":"In the last years, several techniques based in link analysis have been proposed and used in search engines to rank Web pages. As links are generated by humans, link based ranking seems to give better results than traditional automatic techniques such as word based ranking. However, no studies have been done about their real impact. In this paper we extend global page ranking techniques to Web site ranking, and do a first experimental analysis of link ranking regarding the structure and dynamics of the Web.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133003392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/SPIRE.2001.989743
Shunsuke Inenaga, H. Hoshino, A. Shinohara, M. Takeda, S. Arikawa
The Compact Directed Acyclic Word Graph (CDAWG) is a space-eflcient data structure that supports indices of a string. The Symmetric Directed Acyclic Word Graph (SCDAWG) for a string w is a dual structure that supports indices of both w and the reverse of w simultaneously. Blumer et al. gave the first algorithm to construct an SCDAWG from a given string, that works in an of-line manner. In this papec we show an on-line algorithm that constructs an SCDAWGfiom a given string directly.
紧凑有向无环字图(Compact Directed Acyclic Word Graph, CDAWG)是一种空间高效的数据结构,支持字符串的索引。字符串w的对称有向无环字图(SCDAWG)是一个双重结构,它同时支持w的索引和w的倒序索引。Blumer等人给出了第一个从给定字符串构造SCDAWG的算法,该算法以联机方式工作。本文给出了一种直接从给定字符串构造scdawgf的在线算法。
{"title":"On-line construction of symmetric compact directed acyclic word graphs","authors":"Shunsuke Inenaga, H. Hoshino, A. Shinohara, M. Takeda, S. Arikawa","doi":"10.1109/SPIRE.2001.989743","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989743","url":null,"abstract":"The Compact Directed Acyclic Word Graph (CDAWG) is a space-eflcient data structure that supports indices of a string. The Symmetric Directed Acyclic Word Graph (SCDAWG) for a string w is a dual structure that supports indices of both w and the reverse of w simultaneously. Blumer et al. gave the first algorithm to construct an SCDAWG from a given string, that works in an of-line manner. In this papec we show an on-line algorithm that constructs an SCDAWGfiom a given string directly.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127826724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/SPIRE.2001.989732
A. Apostolico
In a passage by J.L. Borges on the "exactitude of Science," a fictitious author describes an Empire in which the art of Cartography "logro tal perfeccion que el mapa de una sola Provincia ocupaba toda la Ciudad, y el mapa del Imperio toda una Provincia." With time, these huge maps wouldn't be enough, and the Colleges of the Cartographers erected a map of the Empire that equalled in width the Empire itself... This paper concerns itself with increasing cases of pattern discovery and data mining in which synopses, indices and relationships thereof seem to grow faster and bigger than the phenomena they were meant to encapsulate. The paper then reviews specific examples of algorithmic and combinatorial constructs that proved capable of alleviating such paradoxes in the author's recent work experience.
{"title":"Of maps bigger than the empire","authors":"A. Apostolico","doi":"10.1109/SPIRE.2001.989732","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989732","url":null,"abstract":"In a passage by J.L. Borges on the \"exactitude of Science,\" a fictitious author describes an Empire in which the art of Cartography \"logro tal perfeccion que el mapa de una sola Provincia ocupaba toda la Ciudad, y el mapa del Imperio toda una Provincia.\" With time, these huge maps wouldn't be enough, and the Colleges of the Cartographers erected a map of the Empire that equalled in width the Empire itself... This paper concerns itself with increasing cases of pattern discovery and data mining in which synopses, indices and relationships thereof seem to grow faster and bigger than the phenomena they were meant to encapsulate. The paper then reviews specific examples of algorithmic and combinatorial constructs that proved capable of alleviating such paradoxes in the author's recent work experience.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114398253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/SPIRE.2001.989759
J. Tarhio
We consider methods for compressing parse trees, especially techniques based on statistical modeling. We regard a sequence of productions corresponding to a sum of the path from the root of a tree to a node x as the context of a node x. The contexts are augmented with branching information of the nodes. By applying the text compression algorithm PPMon such contexts we achieve good compression results. We compare experimentally the PPMapproach with other methods.
{"title":"On compression of parse trees","authors":"J. Tarhio","doi":"10.1109/SPIRE.2001.989759","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989759","url":null,"abstract":"We consider methods for compressing parse trees, especially techniques based on statistical modeling. We regard a sequence of productions corresponding to a sum of the path from the root of a tree to a node x as the context of a node x. The contexts are augmented with branching information of the nodes. By applying the text compression algorithm PPMon such contexts we achieve good compression results. We compare experimentally the PPMapproach with other methods.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123071128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/SPIRE.2001.10027
Y. Zieman, R. Salas
An experimentally proven methodology for computing semantic labels for natural language and its use in semantic processing of text is described. A combinatorial model of the conceptual space is created where semantic labels result as combinations ofprimary or atomic concepts called Semantic Factors. The set of about 2,500 Semantic Factors is defined. The basic semantic element of a language is a morpheme-type element (s-morpheme), the minimalpart ofa language that bears its own meaning. All s-morphemes in the Knowledge Base (about 15,000 for English) are labeled. The label for a phrase (its ¿Concept Codel7 results as a combination of the labels for the smorphemes constituting it. Algorithms are designed to identify the s-morphemes in a phrase and to generate the phrase¿s Concept Code. The matching procedure compares Concept Codes and identifies conceptually close ones - those sharing a maximal number of Semantic Factors. Similarity is identified here as a match between the Concept Codes of two Text objects. Since a Concept Code is essentially language independent, this technology is appropriate for implementation in cross-language applications. An example is described of an application in the bio-medical domain, where documents of a database of more than 12 million titles are being successfully retrieved in about 50% of the queries normally rejected by traditional search methods.
{"title":"Semantic labeling - unveiling the main components of meaning of free-text","authors":"Y. Zieman, R. Salas","doi":"10.1109/SPIRE.2001.10027","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.10027","url":null,"abstract":"An experimentally proven methodology for computing semantic labels for natural language and its use in semantic processing of text is described. A combinatorial model of the conceptual space is created where semantic labels result as combinations ofprimary or atomic concepts called Semantic Factors. The set of about 2,500 Semantic Factors is defined. The basic semantic element of a language is a morpheme-type element (s-morpheme), the minimalpart ofa language that bears its own meaning. All s-morphemes in the Knowledge Base (about 15,000 for English) are labeled. The label for a phrase (its ¿Concept Codel7 results as a combination of the labels for the smorphemes constituting it. Algorithms are designed to identify the s-morphemes in a phrase and to generate the phrase¿s Concept Code. The matching procedure compares Concept Codes and identifies conceptually close ones - those sharing a maximal number of Semantic Factors. Similarity is identified here as a match between the Concept Codes of two Text objects. Since a Concept Code is essentially language independent, this technology is appropriate for implementation in cross-language applications. An example is described of an application in the bio-medical domain, where documents of a database of more than 12 million titles are being successfully retrieved in about 50% of the queries normally rejected by traditional search methods.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"190 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132747928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/SPIRE.2001.989751
V. Makinen
Edit distance is a powerful measure of similarity in string matching, measuring the minimum amount of insertions, deletions, and substitutions to convert a string into another string. This measure is ofte. contrasted with time warping in speech processing, that measures how close two trajectories are by allowing compression and expansion operations on time scale. Erne warping can be easily generalized to measure the similarity between ID point-patterns (ascending lists of real values), as the diference between ith and (i l ) th points in a point-pattern can be considered as the value of a trajectory at the time i. Howeve< we show that edit distance is more natural choice, and derive a measure by calculating the minimum amount of space needed to insert and delete between points to convert a point-pattern into another. We show that this measure defines a metric. We also define a substitution operation such that the distance calculation automatically separates the points into matching and mismatching points. The algorithms are based on dynamic programming. The main motivation for these methods is two and higher dimensional point-pattern matching, and therefore we generalize these methods into the 2 0 case, and show that this generalization leads to an NP-complete problem. There is also applications for the I D case; we discuss shortly the matching of tree ring sequences in dendrochronology.
{"title":"Using edit distance in point-pattern matching","authors":"V. Makinen","doi":"10.1109/SPIRE.2001.989751","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989751","url":null,"abstract":"Edit distance is a powerful measure of similarity in string matching, measuring the minimum amount of insertions, deletions, and substitutions to convert a string into another string. This measure is ofte. contrasted with time warping in speech processing, that measures how close two trajectories are by allowing compression and expansion operations on time scale. Erne warping can be easily generalized to measure the similarity between ID point-patterns (ascending lists of real values), as the diference between ith and (i l ) th points in a point-pattern can be considered as the value of a trajectory at the time i. Howeve< we show that edit distance is more natural choice, and derive a measure by calculating the minimum amount of space needed to insert and delete between points to convert a point-pattern into another. We show that this measure defines a metric. We also define a substitution operation such that the distance calculation automatically separates the points into matching and mismatching points. The algorithms are based on dynamic programming. The main motivation for these methods is two and higher dimensional point-pattern matching, and therefore we generalize these methods into the 2 0 case, and show that this generalization leads to an NP-complete problem. There is also applications for the I D case; we discuss shortly the matching of tree ring sequences in dendrochronology.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127863073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}