ABSTRACT To alleviate the problem of under-utilization features of sentence-level relation extraction, which leads to insufficient performance of the pre-trained language model and underutilization of the feature vector, a sentence-level relation extraction method based on adding prompt information and feature reuse is proposed. At first, in addition to the pair of nominals and sentence information, a piece of prompt information is added, and the overall feature information consists of sentence information, entity pair information, and prompt information, and then the features are encoded by the pre-trained language model ROBERTA. Moreover, in the pre-trained language model, BIGRU is also introduced in the composition of the neural network to extract information, and the feature information is passed through the neural network to form several sets of feature vectors. After that, these feature vectors are reused in different combinations to form multiple outputs, and the outputs are aggregated using ensemble-learning soft voting to perform relation classification. In addition to this, the sum of cross-entropy, KL divergence, and negative log-likelihood loss is used as the final loss function in this paper. In the comparison experiments, the model based on adding prompt information and feature reuse achieved higher results of the SemEval-2010 task 8 relational dataset.
{"title":"Relation Extraction Based on Prompt Information and Feature Reuse","authors":"Ping Feng, Xin Zhang, Jian Zhao, Yingying Wang, Biao Huang","doi":"10.1162/dint_a_00192","DOIUrl":"https://doi.org/10.1162/dint_a_00192","url":null,"abstract":"ABSTRACT To alleviate the problem of under-utilization features of sentence-level relation extraction, which leads to insufficient performance of the pre-trained language model and underutilization of the feature vector, a sentence-level relation extraction method based on adding prompt information and feature reuse is proposed. At first, in addition to the pair of nominals and sentence information, a piece of prompt information is added, and the overall feature information consists of sentence information, entity pair information, and prompt information, and then the features are encoded by the pre-trained language model ROBERTA. Moreover, in the pre-trained language model, BIGRU is also introduced in the composition of the neural network to extract information, and the feature information is passed through the neural network to form several sets of feature vectors. After that, these feature vectors are reused in different combinations to form multiple outputs, and the outputs are aggregated using ensemble-learning soft voting to perform relation classification. In addition to this, the sum of cross-entropy, KL divergence, and negative log-likelihood loss is used as the final loss function in this paper. In the comparison experiments, the model based on adding prompt information and feature reuse achieved higher results of the SemEval-2010 task 8 relational dataset.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"824-840"},"PeriodicalIF":3.9,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47902217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ABSTRACT Transparency is vital to realizing the promise of evidenced-based policymaking, where “evidence-based” means including information as to what data mean and why they should be trusted. Transparency, in turn, requires that enough of this information is provided. Loosely speaking then, transparency is achieved when sufficient documentation is provided. Sufficiency is situation specific, both for the provider and consumer of the documentation. These ideas are presented in two recent US commissioned reports: The Promise of Evidence-Based Policymaking and Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Metadata are a more formalized kind of documentation, and in this paper, we provide and demonstrate necessary, sufficient, and general conditions for achieving transparency from the metadata perspective: conforming to a specification, providing quality metadata, and creating a usable interface to the metadata. These conditions are important for any metadata system, but here the specification is tied to our framework for metadata quality based on the situation-specific needs for transparency. These ideas are described, and their interrelationships are explored.
{"title":"Achieving Transparency: A Metadata Perspective","authors":"Daniel W. Gillman","doi":"10.1162/dint_a_00188","DOIUrl":"https://doi.org/10.1162/dint_a_00188","url":null,"abstract":"ABSTRACT Transparency is vital to realizing the promise of evidenced-based policymaking, where “evidence-based” means including information as to what data mean and why they should be trusted. Transparency, in turn, requires that enough of this information is provided. Loosely speaking then, transparency is achieved when sufficient documentation is provided. Sufficiency is situation specific, both for the provider and consumer of the documentation. These ideas are presented in two recent US commissioned reports: The Promise of Evidence-Based Policymaking and Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Metadata are a more formalized kind of documentation, and in this paper, we provide and demonstrate necessary, sufficient, and general conditions for achieving transparency from the metadata perspective: conforming to a specification, providing quality metadata, and creating a usable interface to the metadata. These conditions are important for any metadata system, but here the specification is tied to our framework for metadata quality based on the situation-specific needs for transparency. These ideas are described, and their interrelationships are explored.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"261-274"},"PeriodicalIF":3.9,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41454480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdullahi Umar Ibrahim, Ayse Gunay Kibarer, Fadi M. Al-Turjman
Tuberculosis caused by Mycobacterium tuberculosis have been a major challenge for medical and healthcare sectors in many underdeveloped countries with limited diagnosis tools. Tuberculosis can be detected from microscopic slides and chest X-ray but as a result of the high cases of tuberculosis, this method can be tedious for both Microbiologists and Radiologists and can lead to miss-diagnosis. These challenges can be solved by employing Computer-Aided Detection (CAD)via AI-driven models which learn features based on convolution and result in an output with high accuracy. In this paper, we described automated discrimination of X-ray and microscope slide images into tuberculosis and non-tuberculosis cases using pretrained AlexNet Models. The study employed Chest X-ray dataset made available on Kaggle repository and microscopic slide images from both Near East University Hospital and Kaggle repository. For classification of tuberculosis using microscopic slide images, the model achieved 90.56% accuracy, 97.78% sensitivity and 83.33% specificity for 70: 30 splits. For classification of tuberculosis using X-ray images, the model achieved 93.89% accuracy, 96.67% sensitivity and 91.11% specificity for 70:30 splits. Our result is in line with the notion that CNN models can be used for classifying medical images with higher accuracy and precision.
{"title":"Computer-aided Detection of Tuberculosis from Microbiological and Radiographic Images","authors":"Abdullahi Umar Ibrahim, Ayse Gunay Kibarer, Fadi M. Al-Turjman","doi":"10.1162/dint_a_00198","DOIUrl":"https://doi.org/10.1162/dint_a_00198","url":null,"abstract":"\u0000 Tuberculosis caused by Mycobacterium tuberculosis have been a major challenge for medical and healthcare sectors in many underdeveloped countries with limited diagnosis tools. Tuberculosis can be detected from microscopic slides and chest X-ray but as a result of the high cases of tuberculosis, this method can be tedious for both Microbiologists and Radiologists and can lead to miss-diagnosis. These challenges can be solved by employing Computer-Aided Detection (CAD)via AI-driven models which learn features based on convolution and result in an output with high accuracy. In this paper, we described automated discrimination of X-ray and microscope slide images into tuberculosis and non-tuberculosis cases using pretrained AlexNet Models. The study employed Chest X-ray dataset made available on Kaggle repository and microscopic slide images from both Near East University Hospital and Kaggle repository. For classification of tuberculosis using microscopic slide images, the model achieved 90.56% accuracy, 97.78% sensitivity and 83.33% specificity for 70: 30 splits. For classification of tuberculosis using X-ray images, the model achieved 93.89% accuracy, 96.67% sensitivity and 91.11% specificity for 70:30 splits. Our result is in line with the notion that CNN models can be used for classifying medical images with higher accuracy and precision.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43215088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ABSTRACT Urban drainage pipe network is the backbone of urban drainage, flood control and water pollution prevention, and is also an essential symbol to measure the level of urban modernization. A large number of underground drainage pipe networks in aged urban areas have been laid for a long time and have reached or practically reached the service age. The repair of drainage pipe networks has attracted extensive attention from all walks of life. Since the Ministry of ecological environment and the national development and Reform Commission jointly issued the action plan for the Yangtze River Protection and restoration in 2019, various provinces in the Yangtze River Basin, such as Anhui, Jiangxi and Hunan, have extensively carried out PPP projects for urban pipeline restoration, in order to improve the quality and efficiency of sewage treatment. Based on the management practice of urban pipe network restoration project in Wuhu City, Anhui Province, this paper analyzes the problems of lengthy construction period and repeated operation caused by the mismatch between the design schedule of the restoration scheme and the construction schedule of the pipe network restoration in the existing project management mode, and proposes a model of urban drainage pipe network restoration scheme selection based on the improved support vector machine. The validity and feasibility of the model are analyzed and verified by collecting the data in the project practice. The research results show that the model has a favorable effect on the selection of urban drainage pipeline restoration schemes, and its accuracy can reach 90%. The research results can provide method guidance and technical support for the rapid decision-making of urban drainage pipeline restoration projects.
{"title":"RS-SVM Machine Learning Approach Driven by Case Data for Selecting Urban Drainage Network Restoration Scheme","authors":"Li Jiang, Zheng Geng, Dong-Hwan Gu, Shuai Guo, Rongmin Huang, Haoke Cheng, Kaixuan Zhu","doi":"10.1162/dint_a_00208","DOIUrl":"https://doi.org/10.1162/dint_a_00208","url":null,"abstract":"ABSTRACT Urban drainage pipe network is the backbone of urban drainage, flood control and water pollution prevention, and is also an essential symbol to measure the level of urban modernization. A large number of underground drainage pipe networks in aged urban areas have been laid for a long time and have reached or practically reached the service age. The repair of drainage pipe networks has attracted extensive attention from all walks of life. Since the Ministry of ecological environment and the national development and Reform Commission jointly issued the action plan for the Yangtze River Protection and restoration in 2019, various provinces in the Yangtze River Basin, such as Anhui, Jiangxi and Hunan, have extensively carried out PPP projects for urban pipeline restoration, in order to improve the quality and efficiency of sewage treatment. Based on the management practice of urban pipe network restoration project in Wuhu City, Anhui Province, this paper analyzes the problems of lengthy construction period and repeated operation caused by the mismatch between the design schedule of the restoration scheme and the construction schedule of the pipe network restoration in the existing project management mode, and proposes a model of urban drainage pipe network restoration scheme selection based on the improved support vector machine. The validity and feasibility of the model are analyzed and verified by collecting the data in the project practice. The research results show that the model has a favorable effect on the selection of urban drainage pipeline restoration schemes, and its accuracy can reach 90%. The research results can provide method guidance and technical support for the rapid decision-making of urban drainage pipeline restoration projects.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"413-437"},"PeriodicalIF":3.9,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41451396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ABSTRACT In this paper, we study cross-domain relation extraction. Since new data mapping to feature spaces always differs from the previously seen data due to a domain shift, few-shot relation extraction often perform poorly. To solve the problems caused by cross-domain, we propose a method for combining the pure entity, relation labels and adversarial (PERLA). We first use entities and complete sentences for separate encoding to obtain context-independent entity features. Then, we combine relation labels which are useful for relation extraction to mitigate context noise. We combine adversarial to reduce the noise caused by cross-domain. We conducted experiments on the publicly available cross-domain relation extraction dataset Fewrel 2.0[1]①, and the results show that our approach improves accuracy and has better transferability for better adaptation to cross-domain tasks.
{"title":"Three Heads Better than One: Pure Entity, Relation Label and Adversarial Training for Cross-domain Few-shot Relation Extraction","authors":"Wenlong Fang, Chunping Ouyang, Qiang Lin, Yue Yuan","doi":"10.1162/dint_a_00190","DOIUrl":"https://doi.org/10.1162/dint_a_00190","url":null,"abstract":"ABSTRACT In this paper, we study cross-domain relation extraction. Since new data mapping to feature spaces always differs from the previously seen data due to a domain shift, few-shot relation extraction often perform poorly. To solve the problems caused by cross-domain, we propose a method for combining the pure entity, relation labels and adversarial (PERLA). We first use entities and complete sentences for separate encoding to obtain context-independent entity features. Then, we combine relation labels which are useful for relation extraction to mitigate context noise. We combine adversarial to reduce the noise caused by cross-domain. We conducted experiments on the publicly available cross-domain relation extraction dataset Fewrel 2.0[1]①, and the results show that our approach improves accuracy and has better transferability for better adaptation to cross-domain tasks.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"807-823"},"PeriodicalIF":3.9,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48827572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Gao, Jianxia Chen, Liang Xiao, Hongyang Wang, Liwei Pan, Xuan Wen, Zhiwei Ye, Xinyun Wu
ABSTRACT Recently, convolutional neural networks (CNNs) have achieved excellent performance for the recommendation system by extracting deep features and building collaborative filtering models. However, CNNs have been verified susceptible to adversarial examples. This is because adversarial samples are subtle non-random disturbances, which indicates that machine learning models produce incorrect outputs. Therefore, we propose a novel model of Adversarial Neural Collaborative Filtering with Embedding Dimension Correlations, named ANCF in short, to address the adversarial problem of CNN-based recommendation system. In particular, the proposed ANCF model adopts the matrix factorization to train the adversarial personalized ranking in the prediction layer. This is because matrix factorization supposes that the linear interaction of the latent factors, which are captured between the user and the item, can describe the observable feedback, thus the proposed ANCF model can learn more complicated representation of their latent factors to improve the performance of recommendation. In addition, the ANCF model utilizes the outer product instead of the inner product or concatenation to learn explicitly pairwise embedding dimensional correlations and obtain the interaction map from which CNNs can utilize its strengths to learn high-order correlations. As a result, the proposed ANCF model can improve the robustness performance by the adversarial personalized ranking, and obtain more information by encoding correlations between different embedding layers. Experimental results carried out on three public datasets demonstrate that the ANCF model outperforms other existing recommendation models.
{"title":"Adversarial Neural Collaborative Filtering with Embedding Dimension Correlations","authors":"Yi Gao, Jianxia Chen, Liang Xiao, Hongyang Wang, Liwei Pan, Xuan Wen, Zhiwei Ye, Xinyun Wu","doi":"10.1162/dint_a_00151","DOIUrl":"https://doi.org/10.1162/dint_a_00151","url":null,"abstract":"ABSTRACT Recently, convolutional neural networks (CNNs) have achieved excellent performance for the recommendation system by extracting deep features and building collaborative filtering models. However, CNNs have been verified susceptible to adversarial examples. This is because adversarial samples are subtle non-random disturbances, which indicates that machine learning models produce incorrect outputs. Therefore, we propose a novel model of Adversarial Neural Collaborative Filtering with Embedding Dimension Correlations, named ANCF in short, to address the adversarial problem of CNN-based recommendation system. In particular, the proposed ANCF model adopts the matrix factorization to train the adversarial personalized ranking in the prediction layer. This is because matrix factorization supposes that the linear interaction of the latent factors, which are captured between the user and the item, can describe the observable feedback, thus the proposed ANCF model can learn more complicated representation of their latent factors to improve the performance of recommendation. In addition, the ANCF model utilizes the outer product instead of the inner product or concatenation to learn explicitly pairwise embedding dimensional correlations and obtain the interaction map from which CNNs can utilize its strengths to learn high-order correlations. As a result, the proposed ANCF model can improve the robustness performance by the adversarial personalized ranking, and obtain more information by encoding correlations between different embedding layers. Experimental results carried out on three public datasets demonstrate that the ANCF model outperforms other existing recommendation models.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"786-806"},"PeriodicalIF":3.9,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45481064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ABSTRACT Implementations of metadata tend to favor centralized, static metadata. This depiction is at variance with the past decade of focus on big data, cloud native architectures and streaming platforms. Big data velocity can demand a correspondingly dynamic view of metadata. These trends, which include DevOps, CI/CD, DataOps and data fabric, are surveyed. Several specific cloud native tools are reviewed and weaknesses in their current metadata use are identified. Implementations are suggested which better exploit capabilities of streaming platform paradigms, in which metadata is continuously collected in dynamic contexts. Future cloud native software features are identified which could enable streamed metadata to power real time data fusion or fine tune automated reasoning through real time ontology updates.
{"title":"Continuous Metadata in Continuous Integration, Stream Processing and Enterprise DataOps","authors":"M. Underwood","doi":"10.1162/dint_a_00193","DOIUrl":"https://doi.org/10.1162/dint_a_00193","url":null,"abstract":"ABSTRACT Implementations of metadata tend to favor centralized, static metadata. This depiction is at variance with the past decade of focus on big data, cloud native architectures and streaming platforms. Big data velocity can demand a correspondingly dynamic view of metadata. These trends, which include DevOps, CI/CD, DataOps and data fabric, are surveyed. Several specific cloud native tools are reviewed and weaknesses in their current metadata use are identified. Implementations are suggested which better exploit capabilities of streaming platform paradigms, in which metadata is continuously collected in dynamic contexts. Future cloud native software features are identified which could enable streamed metadata to power real time data fusion or fine tune automated reasoning through real time ontology updates.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"275-288"},"PeriodicalIF":3.9,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49258477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Z. Ali, Y. Huang, Irfan Ullah, Junlan Feng, Chao Deng, Nimbeshaho Thierry, Asad Khan, Asim Ullah Jan, Xiaoli Shen, Wu Rui, G. Qi
ABSTRACT Making medication prescriptions in response to the patient's diagnosis is a challenging task. The number of pharmaceutical companies, their inventory of medicines, and the recommended dosage confront a doctor with the well-known problem of information and cognitive overload. To assist a medical practitioner in making informed decisions regarding a medical prescription to a patient, researchers have exploited electronic health records (EHRs) in automatically recommending medication. In recent years, medication recommendation using EHRs has been a salient research direction, which has attracted researchers to apply various deep learning (DL) models to the EHRs of patients in recommending prescriptions. Yet, in the absence of a holistic survey article, it needs a lot of effort and time to study these publications in order to understand the current state of research and identify the best-performing models along with the trends and challenges. To fill this research gap, this survey reports on state-of-the-art DL-based medication recommendation methods. It reviews the classification of DL-based medication recommendation (MR) models, compares their performance, and the unavoidable issues they face. It reports on the most common datasets and metrics used in evaluating MR models. The findings of this study have implications for researchers interested in MR models.
{"title":"Deep Learning for Medication Recommendation: A Systematic Survey","authors":"Z. Ali, Y. Huang, Irfan Ullah, Junlan Feng, Chao Deng, Nimbeshaho Thierry, Asad Khan, Asim Ullah Jan, Xiaoli Shen, Wu Rui, G. Qi","doi":"10.1162/dint_a_00197","DOIUrl":"https://doi.org/10.1162/dint_a_00197","url":null,"abstract":"ABSTRACT Making medication prescriptions in response to the patient's diagnosis is a challenging task. The number of pharmaceutical companies, their inventory of medicines, and the recommended dosage confront a doctor with the well-known problem of information and cognitive overload. To assist a medical practitioner in making informed decisions regarding a medical prescription to a patient, researchers have exploited electronic health records (EHRs) in automatically recommending medication. In recent years, medication recommendation using EHRs has been a salient research direction, which has attracted researchers to apply various deep learning (DL) models to the EHRs of patients in recommending prescriptions. Yet, in the absence of a holistic survey article, it needs a lot of effort and time to study these publications in order to understand the current state of research and identify the best-performing models along with the trends and challenges. To fill this research gap, this survey reports on state-of-the-art DL-based medication recommendation methods. It reviews the classification of DL-based medication recommendation (MR) models, compares their performance, and the unavoidable issues they face. It reports on the most common datasets and metrics used in evaluating MR models. The findings of this study have implications for researchers interested in MR models.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"303-354"},"PeriodicalIF":3.9,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46680129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xindong Wu, Hao Chen, Chenyang Bu, Shengwei Ji, Zan Zhang, Victor S. Sheng
Abstract Spreadsheets contain a lot of valuable data and have many practical applications. The key technology of these practical applications is how to make machines understand the semantic structure of spreadsheets, e.g., identifying cell function types and discovering relationships between cell pairs. Most existing methods for understanding the semantic structure of spreadsheets do not make use of the semantic information of cells. A few studies do, but they ignore the layout structure information of spreadsheets, which affects the performance of cell function classification and the discovery of different relationship types of cell pairs. In this paper, we propose a Heuristic algorithm for Understanding the Semantic Structure of spreadsheets (HUSS). Specifically, for improving the cell function classification, we propose an error correction mechanism (ECM) based on an existing cell function classification model [11] and the layout features of spreadsheets. For improving the table structure analysis, we propose five types of heuristic rules to extract four different types of cell pairs, based on the cell style and spatial location information. Our experimental results on five real-world datasets demonstrate that HUSS can effectively understand the semantic structure of spreadsheets and outperforms corresponding baselines.
{"title":"HUSS: A Heuristic Method for Understanding the Semantic Structure of Spreadsheets","authors":"Xindong Wu, Hao Chen, Chenyang Bu, Shengwei Ji, Zan Zhang, Victor S. Sheng","doi":"10.1162/dint_a_00201","DOIUrl":"https://doi.org/10.1162/dint_a_00201","url":null,"abstract":"Abstract Spreadsheets contain a lot of valuable data and have many practical applications. The key technology of these practical applications is how to make machines understand the semantic structure of spreadsheets, e.g., identifying cell function types and discovering relationships between cell pairs. Most existing methods for understanding the semantic structure of spreadsheets do not make use of the semantic information of cells. A few studies do, but they ignore the layout structure information of spreadsheets, which affects the performance of cell function classification and the discovery of different relationship types of cell pairs. In this paper, we propose a Heuristic algorithm for Understanding the Semantic Structure of spreadsheets (HUSS). Specifically, for improving the cell function classification, we propose an error correction mechanism (ECM) based on an existing cell function classification model [11] and the layout features of spreadsheets. For improving the table structure analysis, we propose five types of heuristic rules to extract four different types of cell pairs, based on the cell style and spatial location information. Our experimental results on five real-world datasets demonstrate that HUSS can effectively understand the semantic structure of spreadsheets and outperforms corresponding baselines.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136006652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tsai Hor Chan, Chi Ho Wong, Jiajun Shen, Guosheng Yin
ABSTRACT Heterogeneous information networks (HINs) have been extensively applied to real-world tasks, such as recommendation systems, social networks, and citation networks. While existing HIN representation learning methods can effectively learn the semantic and structural features in the network, little awareness was given to the distribution discrepancy of subgraphs within a single HIN. However, we find that ignoring such distribution discrepancy among subgraphs from multiple sources would hinder the effectiveness of graph embedding learning algorithms. This motivates us to propose SUMSHINE (Scalable Unsupervised Multi-Source Heterogeneous Information Network Embedding)—a scalable unsupervised framework to align the embedding distributions among multiple sources of an HIN. Experimental results on real-world datasets in a variety of downstream tasks validate the performance of our method over the state-of-the-art heterogeneous information network embedding algorithms.
{"title":"Source-Aware Embedding Training on Heterogeneous Information Networks","authors":"Tsai Hor Chan, Chi Ho Wong, Jiajun Shen, Guosheng Yin","doi":"10.1162/dint_a_00200","DOIUrl":"https://doi.org/10.1162/dint_a_00200","url":null,"abstract":"ABSTRACT Heterogeneous information networks (HINs) have been extensively applied to real-world tasks, such as recommendation systems, social networks, and citation networks. While existing HIN representation learning methods can effectively learn the semantic and structural features in the network, little awareness was given to the distribution discrepancy of subgraphs within a single HIN. However, we find that ignoring such distribution discrepancy among subgraphs from multiple sources would hinder the effectiveness of graph embedding learning algorithms. This motivates us to propose SUMSHINE (Scalable Unsupervised Multi-Source Heterogeneous Information Network Embedding)—a scalable unsupervised framework to align the embedding distributions among multiple sources of an HIN. Experimental results on real-world datasets in a variety of downstream tasks validate the performance of our method over the state-of-the-art heterogeneous information network embedding algorithms.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"611-635"},"PeriodicalIF":3.9,"publicationDate":"2023-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44509817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}