Mid-infrared spectra of dried and roasted cocoa (Theobroma cacao L.): A dataset for machine learning-based classification of cocoa varieties and prediction of theobromine and caffeine content

IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES Data in Brief Pub Date : 2025-02-01 Epub Date: 2024-12-19 DOI:10.1016/j.dib.2024.111243

Gentil A. Collazos-Escobar , Andrés F. Bahamón-Monje , Nelson Gutiérrez-Guzmán

{"title":"Mid-infrared spectra of dried and roasted cocoa (Theobroma cacao L.): A dataset for machine learning-based classification of cocoa varieties and prediction of theobromine and caffeine content","authors":"Gentil A. Collazos-Escobar , Andrés F. Bahamón-Monje , Nelson Gutiérrez-Guzmán","doi":"10.1016/j.dib.2024.111243","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents a comprehensive dataset of mid-infrared spectra for dried and roasted cocoa beans (<em>Theobroma cacao</em> L.), along with their corresponding theobromine and caffeine content. Infrared data were acquired using Attenuated Total Reflectance-Fourier Transform Infrared (ATR-FTIR) spectroscopy, while High-Performance Liquid Chromatography (HPLC) was employed to accurately quantify theobromine and caffeine in the dried cocoa beans. The theobromine/caffeine relationship served as a robust chemical marker for distinguishing between different cocoa varieties. This dataset provides a basis for further research, enabling the integration of mid-infrared spectral data with HPLC (as a standard) to fine-tune machine learning and deep learning models that could be used to simultaneously predict the theobromine and caffeine content, as well as cocoa variety in both dried and roasted cocoa samples using a non-destructive approach based on spectral data. The tools developed from this dataset could significantly advance automated processes in the cocoa industry and support decision-making on an industrial scale, facilitating real-time quality control of cocoa-based products, improving cocoa variety classification, and optimizing bean selection, blending strategies, and product formulation, while reducing the need for labor-intensive and costly quantification methods. The dataset is organized into Excel sheets and structured according to experimental conditions and replicates, providing a valuable framework for further analysis, model development, and calibration of multivariate statistical models.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"58 ","pages":"Article 111243"},"PeriodicalIF":1.4000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11748727/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340924012058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/19 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

This paper presents a comprehensive dataset of mid-infrared spectra for dried and roasted cocoa beans (Theobroma cacao L.), along with their corresponding theobromine and caffeine content. Infrared data were acquired using Attenuated Total Reflectance-Fourier Transform Infrared (ATR-FTIR) spectroscopy, while High-Performance Liquid Chromatography (HPLC) was employed to accurately quantify theobromine and caffeine in the dried cocoa beans. The theobromine/caffeine relationship served as a robust chemical marker for distinguishing between different cocoa varieties. This dataset provides a basis for further research, enabling the integration of mid-infrared spectral data with HPLC (as a standard) to fine-tune machine learning and deep learning models that could be used to simultaneously predict the theobromine and caffeine content, as well as cocoa variety in both dried and roasted cocoa samples using a non-destructive approach based on spectral data. The tools developed from this dataset could significantly advance automated processes in the cocoa industry and support decision-making on an industrial scale, facilitating real-time quality control of cocoa-based products, improving cocoa variety classification, and optimizing bean selection, blending strategies, and product formulation, while reducing the need for labor-intensive and costly quantification methods. The dataset is organized into Excel sheets and structured according to experimental conditions and replicates, providing a valuable framework for further analysis, model development, and calibration of multivariate statistical models.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

干燥和烘焙可可（Theobroma cocoa L.）的中红外光谱：一个基于机器学习的可可品种分类和可可碱和咖啡因含量预测数据集。

本文介绍了干燥和烘焙可可豆（Theobroma cacao L.）的中红外光谱综合数据集，以及相应的可可碱和咖啡因含量。红外数据采用衰减全反射-傅里叶变换红外光谱法（ATR-FTIR）获取，高效液相色谱法（HPLC）精确定量可可干中的可可碱和咖啡因。可可碱/咖啡因的关系是区分不同可可品种的强有力的化学标记。该数据集为进一步研究提供了基础，使中红外光谱数据与HPLC（作为标准）的集成能够微调机器学习和深度学习模型，该模型可用于同时预测可可碱和咖啡因含量，以及干燥和烘焙可可样品中的可可品种，使用基于光谱数据的非破坏性方法。从该数据集开发的工具可以显著推进可可产业的自动化过程，支持工业规模的决策，促进可可产品的实时质量控制，改进可可品种分类，优化豆类选择，混合策略和产品配方，同时减少对劳动密集型和昂贵的量化方法的需求。数据集被组织成Excel表格，并根据实验条件和重复进行结构化，为进一步分析、模型开发和多元统计模型的校准提供了有价值的框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Data in Brief MULTIDISCIPLINARY SCIENCES-

CiteScore

3.10

自引率

0.00%

发文量

996

审稿时长

70 days

期刊介绍： Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.