Nan Yin, Junheng Liang, Xi Guo, Xue Jiang, Jie He, Xiaotong Zhang
{"title":"Semi-automatic construction of heterogeneous data schema based on structure and context-aware recommendation.","authors":"Nan Yin, Junheng Liang, Xi Guo, Xue Jiang, Jie He, Xiaotong Zhang","doi":"10.1038/s41597-024-04196-x","DOIUrl":null,"url":null,"abstract":"<p><p>Customizing the structure and format of scientific data facilitates the publication of diverse and heterogeneous data. Many data publishing platforms empower users to create self-designed schemas, leading to schema proliferation and more intricate creation processes. To address these challenges, we present a semi-automatic method and system for constructing heterogeneous material data schemas based on structure and context-aware recommendation. We propose a schema fragment tree structure to represent data schemas with hierarchical relationships, transforming the recommendation into subtree matching. Fragment index and semantic search techniques are introduced to identify candidate fragments, and a tree editing distance algorithm calculates similarity scores. Evaluated on the Data Schema Construction System, the algorithm outperforms baselines-TF-IDF and BM25 for schemas matching-in precision, recall, and F1-score. The baseline for reduced workload refers to the effort required to create schemas without recommendation. Our recommendation improves schema creation efficiency by 50.5% and reduces schema proliferation by 16.5%.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"190"},"PeriodicalIF":5.8000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11787372/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Data","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41597-024-04196-x","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Customizing the structure and format of scientific data facilitates the publication of diverse and heterogeneous data. Many data publishing platforms empower users to create self-designed schemas, leading to schema proliferation and more intricate creation processes. To address these challenges, we present a semi-automatic method and system for constructing heterogeneous material data schemas based on structure and context-aware recommendation. We propose a schema fragment tree structure to represent data schemas with hierarchical relationships, transforming the recommendation into subtree matching. Fragment index and semantic search techniques are introduced to identify candidate fragments, and a tree editing distance algorithm calculates similarity scores. Evaluated on the Data Schema Construction System, the algorithm outperforms baselines-TF-IDF and BM25 for schemas matching-in precision, recall, and F1-score. The baseline for reduced workload refers to the effort required to create schemas without recommendation. Our recommendation improves schema creation efficiency by 50.5% and reduces schema proliferation by 16.5%.
期刊介绍:
Scientific Data is an open-access journal focused on data, publishing descriptions of research datasets and articles on data sharing across natural sciences, medicine, engineering, and social sciences. Its goal is to enhance the sharing and reuse of scientific data, encourage broader data sharing, and acknowledge those who share their data.
The journal primarily publishes Data Descriptors, which offer detailed descriptions of research datasets, including data collection methods and technical analyses validating data quality. These descriptors aim to facilitate data reuse rather than testing hypotheses or presenting new interpretations, methods, or in-depth analyses.