基于网格边界域的大数据不平衡数据采样设计

IF 16.4 1区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Accounts of Chemical Research Pub Date : 2024-03-08 DOI:10.1007/s00180-024-01471-8
{"title":"基于网格边界域的大数据不平衡数据采样设计","authors":"","doi":"10.1007/s00180-024-01471-8","DOIUrl":null,"url":null,"abstract":"<h3>Abstract</h3> <p>The data distribution is often associated with a <em>priori</em>-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.</p>","PeriodicalId":1,"journal":{"name":"Accounts of Chemical Research","volume":null,"pages":null},"PeriodicalIF":16.4000,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Imbalanced data sampling design based on grid boundary domain for big data\",\"authors\":\"\",\"doi\":\"10.1007/s00180-024-01471-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<h3>Abstract</h3> <p>The data distribution is often associated with a <em>priori</em>-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.</p>\",\"PeriodicalId\":1,\"journal\":{\"name\":\"Accounts of Chemical Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":16.4000,\"publicationDate\":\"2024-03-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Accounts of Chemical Research\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1007/s00180-024-01471-8\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of Chemical Research","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1007/s00180-024-01471-8","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

摘要 数据分布往往与事先已知的概率有关,而感兴趣事件的发生概率较小,因此在社会学、经济学、工程学等各个领域都会出现大量的不平衡数据。现有的过采样和欠采样方法被广泛应用于不平衡数据分类问题,但过采样会导致过拟合,而欠采样会忽略有效信息。我们提出了一种新的采样设计算法,称为边界混合采样的邻域网格(NGBM),它关注边界信息。本文通过网格边界域识别获得分类边界信息,从而确定样本的重要性。在此前提下,对边界网格采用合成少数超采样技术,对其他网格采用随机欠采样技术。在这种混合采样策略的帮助下,可以提取出更重要的分类边界信息,尤其是对正样本信息的识别。通过数值模拟和实际数据分析,讨论了 NGBM 的参数设置策略,并说明了所提出的 NGBM 在不平衡数据中的优势以及实际应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Imbalanced data sampling design based on grid boundary domain for big data

Abstract

The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Accounts of Chemical Research
Accounts of Chemical Research 化学-化学综合
CiteScore
31.40
自引率
1.10%
发文量
312
审稿时长
2 months
期刊介绍: Accounts of Chemical Research presents short, concise and critical articles offering easy-to-read overviews of basic research and applications in all areas of chemistry and biochemistry. These short reviews focus on research from the author’s own laboratory and are designed to teach the reader about a research project. In addition, Accounts of Chemical Research publishes commentaries that give an informed opinion on a current research problem. Special Issues online are devoted to a single topic of unusual activity and significance. Accounts of Chemical Research replaces the traditional article abstract with an article "Conspectus." These entries synopsize the research affording the reader a closer look at the content and significance of an article. Through this provision of a more detailed description of the article contents, the Conspectus enhances the article's discoverability by search engines and the exposure for the research.
期刊最新文献
Management of Cholesteatoma: Hearing Rehabilitation. Congenital Cholesteatoma. Evaluation of Cholesteatoma. Management of Cholesteatoma: Extension Beyond Middle Ear/Mastoid. Recidivism and Recurrence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1