Tiantian Li, Daming Zhu, Haitao Jiang, Haodi Feng, Xuefeng Cui
{"title":"最长的k元组公共子字符串","authors":"Tiantian Li, Daming Zhu, Haitao Jiang, Haodi Feng, Xuefeng Cui","doi":"10.1109/BIBM55620.2022.9995199","DOIUrl":null,"url":null,"abstract":"We focus on a new problem that is formulated to find a longest k-tuple of common sub-strings (abbr. k-CSSs) of two or more strings. We present a suffix tree based algorithm for this problem, which can find a longest k-CSS of m strings in $O(kmn^{k})$ time and $O(kmn)$ space where n is the length sum of the m strings. This algorithm can be used to approximate the longest k-CSS problem to a performance ratio $\\frac{1}{\\epsilon}$ in $O(kmn^{\\lceil\\epsilon k\\rceil})$ time for $\\epsilon\\in(0,1]$. Since the algorithm has the space complexity in linear order of n, it will show advantage in comparing particularly long strings. This algorithm proves that the problem that asks to find a longest gapped pattern of non-constant number of strings is polynomial time solvable if the gap number is restricted constant, although the problem without any restriction on the gap number was proved NP-Hard. Using a C++ tool that is reliant on the algorithm, we performed experiments of finding longest 2-CSSs, 3-CSSs and 5-CSSs of 2 ~ 14 COVID-19 S-proteins. Under the help of longest 2-CSSs and 3-CSSs of COVID-19 S-proteins, we identified the mutation sites in the S-proteins of two COVID-19 variants Delta and Omicron. The algorithm based tool is available for downloading at https://github.com/lytt0/k-CSS.","PeriodicalId":210337,"journal":{"name":"2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Longest k-tuple Common Sub-Strings\",\"authors\":\"Tiantian Li, Daming Zhu, Haitao Jiang, Haodi Feng, Xuefeng Cui\",\"doi\":\"10.1109/BIBM55620.2022.9995199\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We focus on a new problem that is formulated to find a longest k-tuple of common sub-strings (abbr. k-CSSs) of two or more strings. We present a suffix tree based algorithm for this problem, which can find a longest k-CSS of m strings in $O(kmn^{k})$ time and $O(kmn)$ space where n is the length sum of the m strings. This algorithm can be used to approximate the longest k-CSS problem to a performance ratio $\\\\frac{1}{\\\\epsilon}$ in $O(kmn^{\\\\lceil\\\\epsilon k\\\\rceil})$ time for $\\\\epsilon\\\\in(0,1]$. Since the algorithm has the space complexity in linear order of n, it will show advantage in comparing particularly long strings. This algorithm proves that the problem that asks to find a longest gapped pattern of non-constant number of strings is polynomial time solvable if the gap number is restricted constant, although the problem without any restriction on the gap number was proved NP-Hard. Using a C++ tool that is reliant on the algorithm, we performed experiments of finding longest 2-CSSs, 3-CSSs and 5-CSSs of 2 ~ 14 COVID-19 S-proteins. Under the help of longest 2-CSSs and 3-CSSs of COVID-19 S-proteins, we identified the mutation sites in the S-proteins of two COVID-19 variants Delta and Omicron. The algorithm based tool is available for downloading at https://github.com/lytt0/k-CSS.\",\"PeriodicalId\":210337,\"journal\":{\"name\":\"2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\",\"volume\":\"148 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBM55620.2022.9995199\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM55620.2022.9995199","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
We focus on a new problem that is formulated to find a longest k-tuple of common sub-strings (abbr. k-CSSs) of two or more strings. We present a suffix tree based algorithm for this problem, which can find a longest k-CSS of m strings in $O(kmn^{k})$ time and $O(kmn)$ space where n is the length sum of the m strings. This algorithm can be used to approximate the longest k-CSS problem to a performance ratio $\frac{1}{\epsilon}$ in $O(kmn^{\lceil\epsilon k\rceil})$ time for $\epsilon\in(0,1]$. Since the algorithm has the space complexity in linear order of n, it will show advantage in comparing particularly long strings. This algorithm proves that the problem that asks to find a longest gapped pattern of non-constant number of strings is polynomial time solvable if the gap number is restricted constant, although the problem without any restriction on the gap number was proved NP-Hard. Using a C++ tool that is reliant on the algorithm, we performed experiments of finding longest 2-CSSs, 3-CSSs and 5-CSSs of 2 ~ 14 COVID-19 S-proteins. Under the help of longest 2-CSSs and 3-CSSs of COVID-19 S-proteins, we identified the mutation sites in the S-proteins of two COVID-19 variants Delta and Omicron. The algorithm based tool is available for downloading at https://github.com/lytt0/k-CSS.