If Francis Bacon were born today, he might have said “data is power” instead of his original saying, “knowledge is power.” In modern society, data is everywhere. In memory of Deming (a guru in quality), this paper attempts to address the fundamental issue of data quality and how Deming would handle it. Specifically, we attempt to explain what data quality really means, and the critical impact that it has on data science. Statisticians, who understand how to collect high quality data, have much more to contribute to both the intellectual vitality and the practical utility of data science. At the same time, data science challenges statisticians to move out of some familiar habits to engage less structured problems, to become more comfortable with ambiguity, and to engage more scientists in a fruitful discussion on what various parties can bring to this new mode of investigation. Some potential avenues for future research in the collection of high-quality data will be proposed.
{"title":"Data Quality: What if Deming Were Born Today?","authors":"Dennis K. J. Lin, Nicholas Rios","doi":"10.1002/asmb.70025","DOIUrl":"https://doi.org/10.1002/asmb.70025","url":null,"abstract":"<p>If Francis Bacon were born today, he might have said “data is power” instead of his original saying, “knowledge is power.” In modern society, data is everywhere. In memory of Deming (a guru in quality), this paper attempts to address the fundamental issue of data quality and how Deming would handle it. Specifically, we attempt to explain what data quality really means, and the critical impact that it has on data science. Statisticians, who understand how to collect high quality data, have much more to contribute to both the intellectual vitality and the practical utility of data science. At the same time, data science challenges statisticians to move out of some familiar habits to engage less structured problems, to become more comfortable with ambiguity, and to engage more scientists in a fruitful discussion on what various parties can bring to this new mode of investigation. Some potential avenues for future research in the collection of high-quality data will be proposed.</p>","PeriodicalId":55495,"journal":{"name":"Applied Stochastic Models in Business and Industry","volume":"41 4","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/asmb.70025","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144514730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
<p>This study introduces a statistical methodology for document clustering that integrates multiple dimensions of textual similarity through network topology analysis. The proposed methodology, which we call Multi-dimensional Similarity Network Analysis (MSNA), extends traditional document-clustering approaches by combining semantic embeddings, topic probability distributions, and emotional probability distribution into a unified similarity measure. We formalize this through a weighted combination of Jensen-Shannon divergences across different probability spaces, creating a comprehensive similarity network. The clustering is achieved through a community detection algorithm that optimizes a multi-objective modularity function, accounting for the different similarity dimensions. We prove the statistical consistency of our approach and derive bounds for the clustering performance under mild regularity conditions. The methodology is validated on a large-scale data set of Airbnb reviews <span></span><math>