{"title":"Enhanced interpretation of novel datasets by summarizing clustering results using deep-learning based linguistic models","authors":"Natarajan K, Srikar Verma, Dheeraj Kumar","doi":"10.1007/s10489-025-06250-6","DOIUrl":null,"url":null,"abstract":"<div><p>In today’s technology-driven era, the proliferation of data is inevitable across various domains. Within engineering, sciences, and business domains, particularly in the context of big data, it can extract actionable insights that can revolutionize the field. Amid data management and analysis, patterns or groups of interconnected data points, commonly referred to as clusters, frequently emerge. These clusters represent distinct subsets containing closely related data points, showcasing unique characteristics compared to other clusters within the same dataset. Spanning across disciplines such as physics, biology, business, and sales, clustering is important in understanding these novel datasets’ essential characteristics, developing complex statistical models, and testing various hypotheses. However, interpreting the characteristics and physical implications of generated clusters by different clustering algorithms is challenging for researchers unfamiliar with these algorithms’ inner workings. This research addresses the intricacies of comprehending data clustering, cluster attributes, and evaluation metrics, especially for individuals lacking proficiency in clustering or related disciplines like statistics. The primary objective of this study is to simplify cluster analysis by furnishing users or analysts from diverse domains with succinct linguistic synopses of clustering results, circumventing the necessity for intricate numerical or mathematical terms. Deep learning techniques based on large language models, such as encoder-decoders (for example, the T5 model) and generative pre-trained transformers (GPTs), are employed to achieve this. This study aims to construct a summarization model capable of ingesting data clusters, producing a condensed overview of the contained insights in a simplified, easily understandable linguistic format. The evaluation process revealed a clear preference among evaluators for the summaries generated by GPT, with T5 summaries following closely behind. GPT and T5 summaries were good at fluency, demonstrating their ability to capture the original content in a human-like manner. In contrast, while providing a structured framework for summarization, the linguistic protoform-based approach is needed to match the quality and coherence of the GPT and T5 summaries.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 5","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06250-6","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In today’s technology-driven era, the proliferation of data is inevitable across various domains. Within engineering, sciences, and business domains, particularly in the context of big data, it can extract actionable insights that can revolutionize the field. Amid data management and analysis, patterns or groups of interconnected data points, commonly referred to as clusters, frequently emerge. These clusters represent distinct subsets containing closely related data points, showcasing unique characteristics compared to other clusters within the same dataset. Spanning across disciplines such as physics, biology, business, and sales, clustering is important in understanding these novel datasets’ essential characteristics, developing complex statistical models, and testing various hypotheses. However, interpreting the characteristics and physical implications of generated clusters by different clustering algorithms is challenging for researchers unfamiliar with these algorithms’ inner workings. This research addresses the intricacies of comprehending data clustering, cluster attributes, and evaluation metrics, especially for individuals lacking proficiency in clustering or related disciplines like statistics. The primary objective of this study is to simplify cluster analysis by furnishing users or analysts from diverse domains with succinct linguistic synopses of clustering results, circumventing the necessity for intricate numerical or mathematical terms. Deep learning techniques based on large language models, such as encoder-decoders (for example, the T5 model) and generative pre-trained transformers (GPTs), are employed to achieve this. This study aims to construct a summarization model capable of ingesting data clusters, producing a condensed overview of the contained insights in a simplified, easily understandable linguistic format. The evaluation process revealed a clear preference among evaluators for the summaries generated by GPT, with T5 summaries following closely behind. GPT and T5 summaries were good at fluency, demonstrating their ability to capture the original content in a human-like manner. In contrast, while providing a structured framework for summarization, the linguistic protoform-based approach is needed to match the quality and coherence of the GPT and T5 summaries.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.