{"title":"数据异质性如何影响基因鉴定中的知识和信息创新:统计学习视角","authors":"Jun Zhao , Fangyi Lao , Guan'ao Yan , Yi Zhang","doi":"10.1016/j.jik.2024.100514","DOIUrl":null,"url":null,"abstract":"<div><p>Data heterogeneity, particularly noted in fields such as genetics, has been identified as a key feature of big data, posing significant challenges to innovation in knowledge and information. This paper focuses on characterizing and understanding the so-called \"curse of heterogeneity\" in gene identification for low infant birth weight from a statistical learning perspective. Owing to the computational and analytical advantages of expectile regression in handling heterogeneity, this paper proposes a flexible, regularized, partially linear additive expectile regression model for high-dimensional heterogeneous data. Unlike most existing works that assume Gaussian or sub-Gaussian error distributions, we adopt a more realistic, less stringent assumption that the errors have only finite moments. Additionally, we derive a two-step algorithm to address the reduced optimization problem and demonstrate that our method, with a probability approaching one, achieves optimal estimation accuracy. Furthermore, we demonstrate that the proposed algorithm converges at least linearly, ensuring the practical applicability of our method. Monte Carlo simulations reveal that our method's resulting estimator performs well in terms of estimation accuracy, model selection, and heterogeneity identification. Empirical analysis in gene trait expression further underscores the potential for guiding public health interventions.</p></div>","PeriodicalId":46792,"journal":{"name":"Journal of Innovation & Knowledge","volume":null,"pages":null},"PeriodicalIF":15.6000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2444569X24000532/pdfft?md5=d725e587bf75bb8e44fdeba06418c07b&pid=1-s2.0-S2444569X24000532-main.pdf","citationCount":"0","resultStr":"{\"title\":\"How data heterogeneity affects innovating knowledge and information in gene identification: A statistical learning perspective\",\"authors\":\"Jun Zhao , Fangyi Lao , Guan'ao Yan , Yi Zhang\",\"doi\":\"10.1016/j.jik.2024.100514\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Data heterogeneity, particularly noted in fields such as genetics, has been identified as a key feature of big data, posing significant challenges to innovation in knowledge and information. This paper focuses on characterizing and understanding the so-called \\\"curse of heterogeneity\\\" in gene identification for low infant birth weight from a statistical learning perspective. Owing to the computational and analytical advantages of expectile regression in handling heterogeneity, this paper proposes a flexible, regularized, partially linear additive expectile regression model for high-dimensional heterogeneous data. Unlike most existing works that assume Gaussian or sub-Gaussian error distributions, we adopt a more realistic, less stringent assumption that the errors have only finite moments. Additionally, we derive a two-step algorithm to address the reduced optimization problem and demonstrate that our method, with a probability approaching one, achieves optimal estimation accuracy. Furthermore, we demonstrate that the proposed algorithm converges at least linearly, ensuring the practical applicability of our method. Monte Carlo simulations reveal that our method's resulting estimator performs well in terms of estimation accuracy, model selection, and heterogeneity identification. Empirical analysis in gene trait expression further underscores the potential for guiding public health interventions.</p></div>\",\"PeriodicalId\":46792,\"journal\":{\"name\":\"Journal of Innovation & Knowledge\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":15.6000,\"publicationDate\":\"2024-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2444569X24000532/pdfft?md5=d725e587bf75bb8e44fdeba06418c07b&pid=1-s2.0-S2444569X24000532-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Innovation & Knowledge\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2444569X24000532\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BUSINESS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Innovation & Knowledge","FirstCategoryId":"91","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2444569X24000532","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BUSINESS","Score":null,"Total":0}
How data heterogeneity affects innovating knowledge and information in gene identification: A statistical learning perspective
Data heterogeneity, particularly noted in fields such as genetics, has been identified as a key feature of big data, posing significant challenges to innovation in knowledge and information. This paper focuses on characterizing and understanding the so-called "curse of heterogeneity" in gene identification for low infant birth weight from a statistical learning perspective. Owing to the computational and analytical advantages of expectile regression in handling heterogeneity, this paper proposes a flexible, regularized, partially linear additive expectile regression model for high-dimensional heterogeneous data. Unlike most existing works that assume Gaussian or sub-Gaussian error distributions, we adopt a more realistic, less stringent assumption that the errors have only finite moments. Additionally, we derive a two-step algorithm to address the reduced optimization problem and demonstrate that our method, with a probability approaching one, achieves optimal estimation accuracy. Furthermore, we demonstrate that the proposed algorithm converges at least linearly, ensuring the practical applicability of our method. Monte Carlo simulations reveal that our method's resulting estimator performs well in terms of estimation accuracy, model selection, and heterogeneity identification. Empirical analysis in gene trait expression further underscores the potential for guiding public health interventions.
期刊介绍:
The Journal of Innovation and Knowledge (JIK) explores how innovation drives knowledge creation and vice versa, emphasizing that not all innovation leads to knowledge, but enduring innovation across diverse fields fosters theory and knowledge. JIK invites papers on innovations enhancing or generating knowledge, covering innovation processes, structures, outcomes, and behaviors at various levels. Articles in JIK examine knowledge-related changes promoting innovation for societal best practices.
JIK serves as a platform for high-quality studies undergoing double-blind peer review, ensuring global dissemination to scholars, practitioners, and policymakers who recognize innovation and knowledge as economic drivers. It publishes theoretical articles, empirical studies, case studies, reviews, and other content, addressing current trends and emerging topics in innovation and knowledge. The journal welcomes suggestions for special issues and encourages articles to showcase contextual differences and lessons for a broad audience.
In essence, JIK is an interdisciplinary journal dedicated to advancing theoretical and practical innovations and knowledge across multiple fields, including Economics, Business and Management, Engineering, Science, and Education.