VNU Journal of Science: Computer Science and Communication Engineering最新文献

英文中文

Abbreviation Detection in Vietnamese Clinical Texts 越南语临床文本中的缩写检测

VNU Journal of Science: Computer Science and Communication Engineering

Pub Date : 2018-12-13 DOI: 10.25073/2588-1086/VNUCSCE.211

C. Vo, T. Cao, Bao Ho

Abbreviations have been widely used in clinical notes because generating clinical notes often takes place under high pressure with lack of writing time and medical record simplification. Those abbreviations limit the clarity and understanding of the records and greatly affect all the computer-based data processing tasks. In this paper, we propose a solution to the abbreviation identification task on clinical notes in a practical context where a few clinical notes have been labeled while so many clinical notes need to be labeled. Our solution is defined with a semi-supervised learning approach that uses level-wise feature engineering to construct an abbreviation identifier, from using a small set of labeled clinical texts and exploiting a larger set of unlabeled clinical texts. A semi-supervised learning algorithm, Semi-RF, and its advanced adaptive version, Weighted Semi-RF, are proposed in the self-training framework using random forest models and Tri-training. Weighted Semi-RF is different from Semi-RF as equipped with a new weighting scheme via adaptation on the current labeled data set. The proposed semi-supervised learning algorithms are practical with parameter-free settings to build an effective abbreviation identifier for identifying abbreviations automatically in clinical texts. Their effectiveness is confirmed with the better Precision and F-measure values from various experiments on real Vietnamese clinical notes. Compared to the existing solutions, our solution is novel for automatic abbreviation identification in clinical notes. Its results can lay the basis for determining the full form of each correctly identified abbreviation and then enhance the readability of the records. Keywords: Electronic medical record, Clinical note, Abbreviation identification, Semi-supervised learning, Self-training, Random forest.

缩略语在临床笔记中得到了广泛的应用，因为临床笔记的生成往往是在高压力下进行的，缺乏写作时间和病历简化。这些缩写限制了记录的清晰度和理解，并极大地影响了所有基于计算机的数据处理任务。本文针对临床笔记标注数量少而标注数量多的实际情况，提出了一种解决临床笔记缩写识别问题的方法。我们的解决方案是用一种半监督学习方法来定义的，该方法使用分层特征工程来构建缩写标识符，使用一小组标记的临床文本和利用一组更大的未标记的临床文本。在使用随机森林模型和三训练的自训练框架中，提出了一种半监督学习算法Semi-RF及其高级自适应版本Weighted Semi-RF。加权半射频与半射频的不同之处在于，它通过对当前标记数据集的自适应，赋予了一种新的加权方案。所提出的半监督学习算法在无参数设置的情况下，能够有效地构建临床文本中缩略语的自动识别标识符。通过对越南临床记录的各种实验，证实了其有效性，并获得了更好的精度和F-measure值。与现有的解决方案相比，我们的解决方案在临床笔记缩略语自动识别方面是新颖的。其结果可以为确定每一个正确识别的缩写的全称形式奠定基础，从而提高记录的可读性。关键词:电子病历，临床笔记，缩写识别，半监督学习，自我训练，随机森林

{"title":"Abbreviation Detection in Vietnamese Clinical Texts","authors":"C. Vo, T. Cao, Bao Ho","doi":"10.25073/2588-1086/VNUCSCE.211","DOIUrl":"https://doi.org/10.25073/2588-1086/VNUCSCE.211","url":null,"abstract":"Abbreviations have been widely used in clinical notes because generating clinical notes often takes place under high pressure with lack of writing time and medical record simplification. Those abbreviations limit the clarity and understanding of the records and greatly affect all the computer-based data processing tasks. In this paper, we propose a solution to the abbreviation identification task on clinical notes in a practical context where a few clinical notes have been labeled while so many clinical notes need to be labeled. Our solution is defined with a semi-supervised learning approach that uses level-wise feature engineering to construct an abbreviation identifier, from using a small set of labeled clinical texts and exploiting a larger set of unlabeled clinical texts. A semi-supervised learning algorithm, Semi-RF, and its advanced adaptive version, Weighted Semi-RF, are proposed in the self-training framework using random forest models and Tri-training. Weighted Semi-RF is different from Semi-RF as equipped with a new weighting scheme via adaptation on the current labeled data set. The proposed semi-supervised learning algorithms are practical with parameter-free settings to build an effective abbreviation identifier for identifying abbreviations automatically in clinical texts. Their effectiveness is confirmed with the better Precision and F-measure values from various experiments on real Vietnamese clinical notes. Compared to the existing solutions, our solution is novel for automatic abbreviation identification in clinical notes. Its results can lay the basis for determining the full form of each correctly identified abbreviation and then enhance the readability of the records. \u0000Keywords: Electronic medical record, Clinical note, Abbreviation identification, Semi-supervised learning, \u0000Self-training, Random forest.","PeriodicalId":416488,"journal":{"name":"VNU Journal of Science: Computer Science and Communication Engineering","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126618446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Score-based Fusion Schemes for Plant Identification from Multi-organ Images 基于分数的植物多器官图像识别融合方案

VNU Journal of Science: Computer Science and Communication Engineering

Pub Date : 2018-12-13 DOI: 10.25073/2588-1086/VNUCSCE.201

Nguyen Thanh Nhan, Do Thanh Binh, Nguyen Hoang, Vu Hai, Tran Thi Thanh Hai, L. Lan

This paper describes some fusion techniques for achieving high accuracy species identification from images of different plant organs. Given a series of different image organs such as branch, entire, flower, or leaf, we firstly extract confidence scores for each single organ using a deep convolutional neural network. Then, various late fusion approaches including conventional transformation-based approaches (sum rule, max rule, product rule), a classification-based approach (support vector machine), and our proposed hybrid fusion model are deployed to determine the identity of the plant of interest. For single organ identification, two schemes are proposed. The first scheme uses one Convolutional neural network (CNN) for each organ while the second one trains one CNN for all organs. Two famous CNNs (AlexNet and Resnet) are chosen in this paper. We evaluate the performances of the proposed method in a large number of images of 50 species which are collected from two primary resources: PlantCLEF 2015 dataset and Internet resources. The experiment exhibits the dominant results of the fusion techniques compared with those of individual organs. At rank-1, the highest species identification accuracy of a single organ is 75.6% for flower images, whereas by applying fusion technique for leaf and flower, the accuracy reaches to 92.6%. We also compare the fusion strategies with the multi-column deep convolutional neural networks (MCDCNN) [1]. The proposed hybrid fusion scheme outperforms MCDCNN in all combinations. It obtains from + 3.0% to + 13.8% improvement in rank-1 over MCDCNN method. The evaluation datasets as well as the source codes are publicly available. Keywords: Plant identification, Convolutional neural network, Deep learning, Fusion.

本文介绍了从不同植物器官图像中实现高精度物种识别的几种融合技术。给定一系列不同的图像器官，如树枝、整枝、花或叶，我们首先使用深度卷积神经网络提取每个单一器官的置信度分数。然后，部署了各种后期融合方法，包括传统的基于转换的方法(总和规则、最大规则、乘积规则)、基于分类的方法(支持向量机)和我们提出的混合融合模型来确定感兴趣植物的身份。对于单个器官的识别，提出了两种方案。第一种方案为每个器官使用一个卷积神经网络(CNN)，而第二种方案为所有器官训练一个CNN。本文选择了两个著名的cnn (AlexNet和Resnet)。我们在从两个主要资源:PlantCLEF 2015数据集和互联网资源中收集的50个物种的大量图像中评估了所提出方法的性能。与单个器官的融合相比，实验显示了融合技术的优势结果。在rank-1时，花图像对单个器官的物种识别准确率最高为75.6%，而应用叶花融合技术对单个器官的物种识别准确率可达92.6%。我们还将融合策略与多列深度卷积神经网络(MCDCNN)进行了比较[1]。所提出的混合融合方案在所有组合中都优于MCDCNN。与MCDCNN方法相比，该方法在rank-1上的改进幅度为+ 3.0% ~ + 13.8%。评估数据集以及源代码都是公开的。关键词:植物识别，卷积神经网络，深度学习，融合

{"title":"Score-based Fusion Schemes for Plant Identification from Multi-organ Images","authors":"Nguyen Thanh Nhan, Do Thanh Binh, Nguyen Hoang, Vu Hai, Tran Thi Thanh Hai, L. Lan","doi":"10.25073/2588-1086/VNUCSCE.201","DOIUrl":"https://doi.org/10.25073/2588-1086/VNUCSCE.201","url":null,"abstract":"This paper describes some fusion techniques for achieving high accuracy species identification from images of different plant organs. Given a series of different image organs such as branch, entire, flower, or leaf, we firstly extract confidence scores for each single organ using a deep convolutional neural network. Then, various late fusion approaches including conventional transformation-based approaches (sum rule, max rule, product rule), a classification-based approach (support vector machine), and our proposed hybrid fusion model are deployed to determine the identity of the plant of interest. For single organ identification, two schemes are proposed. The first scheme uses one Convolutional neural network (CNN) for each organ while the second one trains one CNN for all organs. Two famous CNNs (AlexNet and Resnet) are chosen in this paper. We evaluate the performances of the proposed method in a large number of images of 50 species which are collected from two primary resources: PlantCLEF 2015 dataset and Internet resources. The experiment exhibits the dominant results of the fusion techniques compared with those of individual organs. At rank-1, the highest species identification accuracy of a single organ is 75.6% for flower images, whereas by applying fusion technique for leaf and flower, the accuracy reaches to 92.6%. We also compare the fusion strategies with the multi-column deep convolutional neural networks (MCDCNN) [1]. The proposed hybrid fusion scheme outperforms MCDCNN in all combinations. It obtains from + 3.0% to + 13.8% improvement in rank-1 over MCDCNN method. The evaluation datasets as well as the source codes are publicly available. \u0000Keywords: Plant identification, Convolutional neural network, Deep learning, Fusion. \u0000 ","PeriodicalId":416488,"journal":{"name":"VNU Journal of Science: Computer Science and Communication Engineering","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122637236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Impacts of Licensed Interference and Inaccurate Channel Information on Information Security in Spectrum Sharing Environment 频谱共享环境下许可干扰和信道信息不准确对信息安全的影响

VNU Journal of Science: Computer Science and Communication Engineering

Pub Date : 2018-09-27 DOI: 10.25073/2588-1086/VNUCSCE.199

Dominika Thiem, H. Khuong

Spectrum sharing environment creates cross-interference between licensed network and unlicensednetwork. Most existing works consider unlicensed interference (i.e., interference from unlicensed networkto licensed network) while ignoring licensed interference (i.e., interference from licensed networkto unlicensed network). Moreover, existing channel estimation algorithms cannot exactly estimate channelinformation. In this paper, impacts of licensed interference and inaccurate channel information oninformation security in the spectrum sharing environment is analyzed under peak transmit power bound,peak interference power bound, and Rayleigh fading. Toward this end, a secrecy outage probabilityformula is proposed in an exact form and validated by simulations. Various results illustrate that secrecyoutage probability is constant in a range of large peak interference powers or large peak transmit powers,and is severely affected by licensed interference and inaccurate channel information.

频谱共享环境造成授权网络和非授权网络之间的交叉干扰。大多数现有作品考虑了未授权的干扰(即，从未授权的网络到已授权的网络的干扰)，而忽略了已授权的干扰(即，从已授权的网络到未授权的网络的干扰)。此外，现有的信道估计算法不能准确估计信道信息。本文分析了在峰值发射功率边界、峰值干扰功率边界和瑞利衰落条件下，频谱共享环境下许可干扰和信道信息不准确对信息安全的影响。为此，提出了一个精确的保密中断概率公式，并通过仿真进行了验证。各种结果表明，在峰值干扰功率较大或峰值发射功率较大的范围内，保密中断概率是恒定的，并且受到许可干扰和信道信息不准确的严重影响。

引用次数: 0

A Low Area, Low Power 8-bit AES-CCM Authenticated Encryption Core in 180nm CMOS Process 180nm CMOS制程的低面积、低功耗8位AES-CCM认证加密核心

VNU Journal of Science: Computer Science and Communication Engineering

Pub Date : 2018-09-27 DOI: 10.25073/2588-1086/vnucsce.202

Dao Van Lan, Nguyen Anh Thai, Hoang Van Phuc

This paper presents a low area, low power AES-CCM authenticated encryption IP core with silicon demonstration in 180nm standard CMOS process. The proposed AES-CCM core combines a low area 8-bit single S-box AES encryption core, improved iterative structure and other optimized circuits. The implementation results show that the proposed AES-CCM core achieves very high resource efficiency with 6.5 kgates GE and the low power consumption of 11.6 µW/MHz while meeting the requirement of the operation speed for many applications including IEEE 802.15.6 WBANs. The detail implementation and optimization results are also presented and discussed.

本文提出了一种低面积、低功耗的AES-CCM认证加密IP核，并在180nm标准CMOS工艺下进行了硅演示。提出的AES- ccm核心结合了低面积8位单s盒AES加密核心、改进的迭代结构和其他优化电路。实现结果表明，提出的AES-CCM核心在满足包括IEEE 802.15.6 wban在内的许多应用对运行速度的要求的同时，以6.5 kgates GE和11.6 μ W/MHz的低功耗实现了非常高的资源效率。给出了具体的实现和优化结果，并进行了讨论。

引用次数: 1

An Authorisation Policy Management Model in Federations 联邦中的授权策略管理模型

VNU Journal of Science: Computer Science and Communication Engineering

Pub Date : 2018-09-27 DOI: 10.25073/2588-1086/VNUCSCE.174

Vu Ngoc Cham, N. Anh

A federation is usually an alliance of organisations where users from one organisation are trusted to access resources in another organisation. The membership of federations is diverse and continually changing. Federations require distributed and dynamic security policy management to meet these challenges. We propose an authorisation policy management model, FABACD, which simplifies the management of collaborations between organisations. It allows distributed and trusted administrators to adjust the authorisation policies in a resource holding organisation, whilst ensuring that the latter remains in ultimate control. The net result is that a resource’s authorisation system is able to use user credentials built from preexisting attributes issued by any participating organisation, in order to determine a user’s access rights to the various resources, without requiring credentials to be issued that are based on federation specific attributes. The model significantly simplifies the authorisation management process for the resource holding organisation.

联盟通常是组织的联盟，其中一个组织的用户被信任访问另一个组织的资源。联合会的成员是多样化的，并且不断变化。联邦需要分布式和动态的安全策略管理来应对这些挑战。我们提出了一个授权策略管理模型，FABACD，它简化了组织之间协作的管理。它允许分布式和受信任的管理员在资源持有组织中调整授权策略，同时确保后者保持最终控制。最终结果是，资源的授权系统能够使用由任何参与组织发布的预先存在的属性构建的用户凭据，以便确定用户对各种资源的访问权限，而不需要基于联邦特定属性发布凭据。该模型显著简化了资源持有组织的授权管理过程。

引用次数: 0

Efficient and Low Complexity Surveillance Video Compression using Distributed Scalable Video Coding 基于分布式可扩展视频编码的高效低复杂度监控视频压缩

VNU Journal of Science: Computer Science and Communication Engineering

Pub Date : 2018-09-27 DOI: 10.25073/2588-1086/VNUCSCE.198

Le Dao Thi Hue, Luong Pham Van, D. Trieu, Xiem HoangVan

Video surveillance has been playing an important role in public safety and privacy protection in recent years thanks to its capability of providing the activity monitoring and content analyzing. However, the data associated with long hours surveillance video is huge, making it less attractive to practical applications. In this paper, we propose a low complexity, yet efficient scalable video coding solution for video surveillance system. The proposed surveillance video compression scheme is able to provide the quality scalability feature by following a layered coding structure that consists of one or several enhancement layers on the top of a base layer. In addition, to maintain the backward compatibility with the current video coding standards, the state-of-the-art video coding standard, i.e., High Efficiency Video Coding (HEVC), is employed in the proposed coding solution to compress the base layer. To satisfy the low complexity requirement of the encoder for the video surveillance systems, the distributed coding concept is employed at the enhancement layers. Experiments conducted for a rich set of surveillance video data shown that the proposed surveillance - distributed scalable video coding (S-DSVC) solution significantly outperforms relevant video coding benchmarks, notably the SHVC standard and the HEVC-simulcasting while requiring much lower computational complexity at the encoder which is essential for practical video surveillance applications.

近年来，视频监控以其提供活动监控和内容分析的能力，在公共安全和隐私保护方面发挥着重要作用。然而，与长时间监控视频相关的数据是巨大的，这使得它对实际应用的吸引力降低。本文针对视频监控系统提出了一种低复杂度、高效可扩展的视频编码方案。本文提出的监控视频压缩方案通过采用在基础层之上由一个或多个增强层组成的分层编码结构来提供高质量的可扩展性特征。此外，为了保持与当前视频编码标准的向后兼容性，本文提出的编码方案采用了最先进的视频编码标准HEVC (High Efficiency video coding)对基础层进行压缩。为了满足视频监控系统对编码器的低复杂度要求，在增强层采用了分布式编码的概念。针对丰富的监控视频数据进行的实验表明，所提出的监控分布式可扩展视频编码(S-DSVC)解决方案显著优于相关的视频编码基准，特别是SHVC标准和hevc -联播，同时编码器的计算复杂度大大降低，这对于实际视频监控应用至关重要。

{"title":"Efficient and Low Complexity Surveillance Video Compression using Distributed Scalable Video Coding","authors":"Le Dao Thi Hue, Luong Pham Van, D. Trieu, Xiem HoangVan","doi":"10.25073/2588-1086/VNUCSCE.198","DOIUrl":"https://doi.org/10.25073/2588-1086/VNUCSCE.198","url":null,"abstract":"Video surveillance has been playing an important role in public safety and privacy protection in recent years thanks to its capability of providing the activity monitoring and content analyzing. However, the data associated with long hours surveillance video is huge, making it less attractive to practical applications. In this paper, we propose a low complexity, yet efficient scalable video coding solution for video surveillance system. The proposed surveillance video compression scheme is able to provide the quality scalability feature by following a layered coding structure that consists of one or several enhancement layers on the top of a base layer. In addition, to maintain the backward compatibility with the current video coding standards, the state-of-the-art video coding standard, i.e., High Efficiency Video Coding (HEVC), is employed in the proposed coding solution to compress the base layer. To satisfy the low complexity requirement of the encoder for the video surveillance systems, the distributed coding concept is employed at the enhancement layers. Experiments conducted for a rich set of surveillance video data shown that the proposed surveillance - distributed scalable video coding (S-DSVC) solution significantly outperforms relevant video coding benchmarks, notably the SHVC standard and the HEVC-simulcasting while requiring much lower computational complexity at the encoder which is essential for practical video surveillance applications.","PeriodicalId":416488,"journal":{"name":"VNU Journal of Science: Computer Science and Communication Engineering","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116949248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An Adaptive and Wide-Range Output DC-DC Converter for Loading Circuit of Li-Ion Battery Charger 用于锂离子电池充电器负载电路的自适应大范围输出DC-DC变换器

VNU Journal of Science: Computer Science and Communication Engineering

Pub Date : 2018-07-04 DOI: 10.25073/2588-1086/vnucsce.205

N. V. Hao, N. D. Minh, Pham Nguyen Thanh Loan

In this paper, an adaptive and wide-range output DC-DC converter designed for lithium-ion (Li-Ion) battery charger circuit is proposed. The converter operates in continuous conduction mode (CCM) to provide an output voltage in response to battery voltage and a wide-range output current to ensure that circuit requirements are met. This circuit is designed on Cadence using 0.35-um BCD technology. Simulation results show that the circuit fully operates in CCM mode with a load current from 50 mA to 1000 mA and output voltage ripple factor is less than 1 %. Furthermore, the current supplied to the load circuit responses to three types of Li-Ion rechargeable currents. The output voltage of the converter varies from 2.8 to 4.5 V corresponding to the voltage range of the battery being charged from 2.5 to 4.2 V. The average power efficiency of the converter in large load current mode (1000 mA) reaches 94 %.

提出了一种适用于锂离子电池充电电路的自适应大范围输出DC-DC变换器。转换器工作在连续传导模式(CCM)，以提供响应电池电压的输出电压和宽范围输出电流，以确保满足电路要求。该电路是在Cadence上采用0.35 um BCD技术设计的。仿真结果表明，该电路完全工作在CCM模式下，负载电流在50ma ~ 1000ma范围内，输出电压纹波因数小于1%。此外，提供给负载电路的电流响应三种类型的锂离子可充电电流。转换器的输出电压为2.8 ~ 4.5 V，对应于被充电电池的电压范围为2.5 ~ 4.2 V。在大负载电流模式下(1000 mA)，变换器的平均功率效率达到94%。

引用次数: 0

A new memetic algorithm for multiple graph alignment 一种新的多图对齐模因算法

VNU Journal of Science: Computer Science and Communication Engineering

Pub Date : 2018-06-10 DOI: 10.25073/2588-1086/VNUCSCE.194

Trần Ngọc Hà, Le Nhu Hien, H. X. Huan

One of the main tasks of structural biology is comparing the structure of proteins. Comparisons of protein structure can determine their functional similarities. Multigraph alignment is a useful tool for identifying functional similarities based on structural analysis. This article proposes a new algorithm for aligning protein binding sites called ACOTS-MGA. This algorithm is based on the memetic scheme. It uses the ACO method to construct a set of solutions, then selects the best solution for implementing Tabu Search to improve the solution quality. Experimental results have shown that ACOTS-MGA outperforms state-of-the-art algorithms while producing alignments of better quality.KeywordsMultiple Graph Alignment, Tabu Search, Ant Colony Optimization, local search, memetic algorithm, SMMAS pheromone update rule, protein active sitesReferencesE. Todd, C. A. Orengo, and J. M. Thornton, “Evolution of function in protein superfamilies, from a structural perspective,” J. Mol. Biol., vol. 307, no. 4, pp. 1113–1143, Apr. 2001.S. F. Altschul et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Res., vol. 25, pp. 3389–3402, 1997.R. C. Edgar, “MUSCLE: multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Res., vol. 32, no. 5, pp. 1792–1797, Mar. 2004.J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice,” Nucleic Acids Res., vol. 22, no. 22, pp. 4673–4680, Nov. 1994.M. Larkin, G. Blackshields, N. Brown, … R. C.-, and undefined 2007, “Clustal W and Clustal X version 2.0,” academic.oup.com.C. Notredame, D. G. Higgins, and J. Heringa, “T-coffee: a novel method for fast and accurate multiple sequence alignment,” J. Mol. Biol., vol. 302, no. 1, pp. 205–217, Sep. 2000.K. Sjolander, “Phylogenomic inference of protein molecular function: advances and challenges,” Bioinformatics, vol. 20, no. 2, pp. 170–179, Jan. 2004.T. Fober, M. Mernberger, G. Klebe, and E. Hüllermeier, “Evolutionary construction of multiple graph alignments for the structural analysis of biomolecules,” Bioinformatics, vol. 25, no. 16, pp. 2110–2117, 2009.M. Mernberger, G. Klebe, and E. Hullermeier, “SEGA: Semiglobal Graph Alignment for Structure-Based Protein Comparison,” IEEE/ACM Trans. Comput. Biol. Bioinforma., vol. 8, no. 5, pp. 1330–1343, Sep. 2011.D. Shasha, J. T. L. Wang, and R. Giugno, “Algorithmics and applications of tree and graph searching,” in Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems - PODS ’02, 2002, p. 39.R. V. Spriggs, P. J. Artymiuk, and P. Willett, “Searching for Patterns of Amino Acids in 3D Protein Structures,” J. Chem. Inf. Comput. Sci., vol. 43, no. 2, pp. 412–421, Mar. 2003.D. Conte, P. Foggia, C. Sansone, And M. Vento, “Thirty years of graph matching in pattern recognition,

柏林，海德堡:施普林格柏林，海德堡，2011。龚志强，彭志强，“基于Memetic算法的全球生物网络定位”，中国科学院学报(自然科学版)。第一版。医学杂志。Bioinforma。，第13卷，第3期。6, pp. 1117-1129, 2016年11月。M. Caldonazzo Garbelini, A. Y. Kashiwabara和D. S. Sanches，“基于模因算法的序列基序查找器”，BMC生物信息学，第19卷，2018。L. Correa, B. Borguesan, C. Farfan, M. Inostroza-Ponta, M. Dorn，“三维蛋白质结构预测问题的模因算法”，IEEE/ACM Trans。第一版。医学杂志。Bioinforma。， pp. 1-1, 2016。Tran Ngoc, D. Do Duc和H. Hoang Xuan，“一种新的基于蚂蚁的多图对齐算法”，2014年国际先进通信技术会议(ATC 2014)， 2014, pp. 181-186。黄洪祥，黄洪涛，“基于蚁群优化的旅行商问题求解:一种新的高效算法”，电子学报。Commun。，第2卷，第2期。3 - 4, 2013年。杜德德，丁洪清，黄轩，“蚁群优化方法的信息素更新规则研究”，2008,pp. 153-160。

{"title":"A new memetic algorithm for multiple graph alignment","authors":"Trần Ngọc Hà, Le Nhu Hien, H. X. Huan","doi":"10.25073/2588-1086/VNUCSCE.194","DOIUrl":"https://doi.org/10.25073/2588-1086/VNUCSCE.194","url":null,"abstract":"One of the main tasks of structural biology is comparing the structure of proteins. Comparisons of protein structure can determine their functional similarities. Multigraph alignment is a useful tool for identifying functional similarities based on structural analysis. This article proposes a new algorithm for aligning protein binding sites called ACOTS-MGA. This algorithm is based on the memetic scheme. It uses the ACO method to construct a set of solutions, then selects the best solution for implementing Tabu Search to improve the solution quality. Experimental results have shown that ACOTS-MGA outperforms state-of-the-art algorithms while producing alignments of better quality.KeywordsMultiple Graph Alignment, Tabu Search, Ant Colony Optimization, local search, memetic algorithm, SMMAS pheromone update rule, protein active sitesReferencesE. Todd, C. A. Orengo, and J. M. Thornton, “Evolution of function in protein superfamilies, from a structural perspective,” J. Mol. Biol., vol. 307, no. 4, pp. 1113–1143, Apr. 2001.S. F. Altschul et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Res., vol. 25, pp. 3389–3402, 1997.R. C. Edgar, “MUSCLE: multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Res., vol. 32, no. 5, pp. 1792–1797, Mar. 2004.J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice,” Nucleic Acids Res., vol. 22, no. 22, pp. 4673–4680, Nov. 1994.M. Larkin, G. Blackshields, N. Brown, … R. C.-, and undefined 2007, “Clustal W and Clustal X version 2.0,” academic.oup.com.C. Notredame, D. G. Higgins, and J. Heringa, “T-coffee: a novel method for fast and accurate multiple sequence alignment,” J. Mol. Biol., vol. 302, no. 1, pp. 205–217, Sep. 2000.K. Sjolander, “Phylogenomic inference of protein molecular function: advances and challenges,” Bioinformatics, vol. 20, no. 2, pp. 170–179, Jan. 2004.T. Fober, M. Mernberger, G. Klebe, and E. Hüllermeier, “Evolutionary construction of multiple graph alignments for the structural analysis of biomolecules,” Bioinformatics, vol. 25, no. 16, pp. 2110–2117, 2009.M. Mernberger, G. Klebe, and E. Hullermeier, “SEGA: Semiglobal Graph Alignment for Structure-Based Protein Comparison,” IEEE/ACM Trans. Comput. Biol. Bioinforma., vol. 8, no. 5, pp. 1330–1343, Sep. 2011.D. Shasha, J. T. L. Wang, and R. Giugno, “Algorithmics and applications of tree and graph searching,” in Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems - PODS ’02, 2002, p. 39.R. V. Spriggs, P. J. Artymiuk, and P. Willett, “Searching for Patterns of Amino Acids in 3D Protein Structures,” J. Chem. Inf. Comput. Sci., vol. 43, no. 2, pp. 412–421, Mar. 2003.D. Conte, P. Foggia, C. Sansone, And M. Vento, “Thirty years of graph matching in pattern recognition,","PeriodicalId":416488,"journal":{"name":"VNU Journal of Science: Computer Science and Communication Engineering","volume":"369 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122343677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic building of a large and complete dataset for image-based table structure recognition 基于图像的表结构识别的大型完整数据集的自动构建

VNU Journal of Science: Computer Science and Communication Engineering

Pub Date : 1900-01-01 DOI: 10.25073/2588-1086/vnucsce.293

Tran Quang Vinh, Nguyen Thi Ngoc Diep

Table is one of the most common ways to represent structured data in documents. Existing researches on image-based table structure recognition often rely on limited datasets with the largest amount of 3,789 human-labeled tables as ICDAR 19 Track B dataset. A recent TableBank dataset for table structures contains 145K tables, however, the tables are labeled in an HTML tag sequence format, which impedes the development of image-based recognition methods. In this paper, we propose several processing methods that automatically convert an HTML tag sequence annotation into bounding box annotation for table cells in one table image. By ensembling these methods, we could convert 42,028 tables with high correctness, which is 11 times larger than the largest existing dataset (ICDAR 19). We then demonstrate that using these bounding box annotations, a straightforward representation of objects in images, we can achieve much higher F1-scores of table structure recognition at many high IoU thresholds using only off-the-shelf deep learning models: F1-score of 0.66 compared to the state-of-the-art of 0.44 for ICDAR19 dataset. A further experiment on using explicit bounding box annotation for image-based table structure recognition results in higher accuracy (70.6%) than implicit text sequence annotation (only 33.8%). The experiments show the effectiveness of our largest-to-date dataset to open up opportunities to generalize on real-world applications. Our dataset and experimental models are publicly available at shorturl.at/hwHY3

表是在文档中表示结构化数据的最常用方法之一。现有的基于图像的表结构识别研究往往依赖于有限的数据集，如ICDAR 19 Track B数据集，数量最多的是3789张人工标记表。最近用于表结构的TableBank数据集包含145K个表，但是，这些表以HTML标记序列格式进行标记，这阻碍了基于图像的识别方法的开发。在本文中，我们提出了几种处理方法，将HTML标记序列注释自动转换为一个表格图像中表格单元格的边界框注释。通过集成这些方法，我们可以以高正确性转换42,028个表，这是现有最大数据集(ICDAR 19)的11倍。然后，我们证明使用这些边界框注释(图像中对象的直接表示)，我们可以仅使用现成的深度学习模型在许多高IoU阈值下实现更高的表结构识别f1分数:f1分数为0.66，而ICDAR19数据集的最新水平为0.44。在进一步的实验中，使用显式边界框标注进行基于图像的表结构识别，准确率(70.6%)高于隐式文本序列标注(33.8%)。实验显示了我们迄今为止最大的数据集的有效性，为推广现实世界的应用提供了机会。我们的数据集和实验模型可以在shorturl.at/hwHY3上公开获得

{"title":"Automatic building of a large and complete dataset for image-based table structure recognition","authors":"Tran Quang Vinh, Nguyen Thi Ngoc Diep","doi":"10.25073/2588-1086/vnucsce.293","DOIUrl":"https://doi.org/10.25073/2588-1086/vnucsce.293","url":null,"abstract":"Table is one of the most common ways to represent structured data in documents. Existing researches on image-based table structure recognition often rely on limited datasets with the largest amount of 3,789 human-labeled tables as ICDAR 19 Track B dataset. A recent TableBank dataset for table structures contains 145K tables, however, the tables are labeled in an HTML tag sequence format, which impedes the development of image-based recognition methods. In this paper, we propose several processing methods that automatically convert an HTML tag sequence annotation into bounding box annotation for table cells in one table image. By ensembling these methods, we could convert 42,028 tables with high correctness, which is 11 times larger than the largest existing dataset (ICDAR 19). We then demonstrate that using these bounding box annotations, a straightforward representation of objects in images, we can achieve much higher F1-scores of table structure recognition at many high IoU thresholds using only off-the-shelf deep learning models: F1-score of 0.66 compared to the state-of-the-art of 0.44 for ICDAR19 dataset. A further experiment on using explicit bounding box annotation for image-based table structure recognition results in higher accuracy (70.6%) than implicit text sequence annotation (only 33.8%). The experiments show the effectiveness of our largest-to-date dataset to open up opportunities to generalize on real-world applications. Our dataset and experimental models are publicly available at shorturl.at/hwHY3","PeriodicalId":416488,"journal":{"name":"VNU Journal of Science: Computer Science and Communication Engineering","volume":"455 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122603050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

VNU Journal of Science: Computer Science and Communication Engineering

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀