首页 > 最新文献

International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management最新文献

英文 中文
A novel centroid based sentence classification approach for extractive summarization of COVID-19 news reports. 一种新的基于质心的句子分类方法,用于新冠肺炎新闻报道的提取摘要。
Sumanta Banerjee, Shyamapada Mukherjee, Sivaji Bandyopadhyay

A COVID-19 news covers subtopics like infections, deaths, the economy, jobs, and more. The proposed method generates a news summary based on the subtopics of a reader's interest. It extracts a centroid having the lexical pattern of the sentences on those subtopics by the frequently used words in them. The centroid is then used as a query in the vector space model (VSM) for sentence classification and extraction, producing a query focused summarization (QFS) of the documents. Three approaches, TF-IDF, word vector averaging, and auto-encoder are experimented to generate sentence embedding that are used in VSM. These embeddings are ranked depending on their similarities with the query embedding. A Novel approach has been introduced to find the value for the similarity parameter using a supervised technique to classify the sentences. Finally, the performance of the method has been assessed in two different ways. All the sentences of the dataset are considered together in the first assessment and in the second, each document wise group of sentences is considered separately using fivefold cross-validation. The proposed method has achieved a minimum of 0.60 to a maximum of 0.63 mean F1 scores with the three sentence encoding approaches on the test dataset.

新冠肺炎新闻涵盖了感染、死亡、经济、就业等副主题。所提出的方法基于读者感兴趣的子主题生成新闻摘要。它通过子主题中的常用词来提取具有子主题句子词汇模式的质心。然后,质心被用作向量空间模型(VSM)中的查询,用于句子分类和提取,从而生成文档的以查询为中心的摘要(QFS)。实验了TF-IDF、词向量平均和自动编码器三种方法来生成VSM中使用的句子嵌入。这些嵌入根据它们与查询嵌入的相似性进行排序。引入了一种新的方法,使用监督技术对句子进行分类,以找到相似性参数的值。最后,通过两种不同的方式对该方法的性能进行了评估。在第一次评估中,数据集的所有句子都被一起考虑,在第二次评估中使用五倍交叉验证分别考虑每个文档中的句子组。所提出的方法在测试数据集上使用三种句子编码方法获得了最小0.60到最大0.63的平均F1分数。
{"title":"A novel centroid based sentence classification approach for extractive summarization of COVID-19 news reports.","authors":"Sumanta Banerjee,&nbsp;Shyamapada Mukherjee,&nbsp;Sivaji Bandyopadhyay","doi":"10.1007/s41870-023-01221-x","DOIUrl":"10.1007/s41870-023-01221-x","url":null,"abstract":"<p><p>A COVID-19 news covers subtopics like infections, deaths, the economy, jobs, and more. The proposed method generates a news summary based on the subtopics of a reader's interest. It extracts a centroid having the lexical pattern of the sentences on those subtopics by the frequently used words in them. The centroid is then used as a query in the vector space model (VSM) for sentence classification and extraction, producing a query focused summarization (QFS) of the documents. Three approaches, TF-IDF, word vector averaging, and auto-encoder are experimented to generate sentence embedding that are used in VSM. These embeddings are ranked depending on their similarities with the query embedding. A Novel approach has been introduced to find the value for the similarity parameter using a supervised technique to classify the sentences. Finally, the performance of the method has been assessed in two different ways. All the sentences of the dataset are considered together in the first assessment and in the second, each document wise group of sentences is considered separately using fivefold cross-validation. The proposed method has achieved a minimum of 0.60 to a maximum of 0.63 mean F1 scores with the three sentence encoding approaches on the test dataset.</p>","PeriodicalId":73455,"journal":{"name":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","volume":"15 4","pages":"1789-1801"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10036244/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9606378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Class balancing framework for credit card fraud detection based on clustering and similarity-based selection (SBS). 基于聚类和基于相似度选择的信用卡欺诈检测类平衡框架。
Hadeel Ahmad, Bassam Kasasbeh, Balqees Aldabaybah, Enas Rawashdeh

Credit card fraud is a growing problem nowadays and it has escalated during COVID-19 due to the authorities in many countries requiring people to use cashless transactions. Every year, billions of Euros are lost due to credit card fraud transactions, therefore, fraud detection systems are essential for financial institutions. As the classes' distribution is not equally represented in the credit card dataset, the machine learning trains the model according to the majority class which leads to inaccurate fraud predictions. For that, in this research, we mainly focus on processing unbalanced data by using an under-sampling technique to get more accurate and better results with different machine learning algorithms. We propose a framework that is based on clustering the dataset using fuzzy C-means and selecting similar fraud and normal instances that have the same features, which guarantees the integrity between the data features.

信用卡欺诈如今是一个日益严重的问题,由于许多国家的当局要求人们使用无现金交易,这一问题在COVID-19期间升级了。每年,由于信用卡欺诈交易造成数十亿欧元的损失,因此,欺诈检测系统对金融机构至关重要。由于类的分布在信用卡数据集中没有均匀地表示,机器学习根据大多数类训练模型,从而导致不准确的欺诈预测。为此,在本研究中,我们主要通过使用欠采样技术来处理不平衡数据,从而通过不同的机器学习算法获得更准确和更好的结果。我们提出了一种基于模糊c均值聚类数据集的框架,选择具有相同特征的相似欺诈和正常实例,保证了数据特征之间的完整性。
{"title":"Class balancing framework for credit card fraud detection based on clustering and similarity-based selection (SBS).","authors":"Hadeel Ahmad,&nbsp;Bassam Kasasbeh,&nbsp;Balqees Aldabaybah,&nbsp;Enas Rawashdeh","doi":"10.1007/s41870-022-00987-w","DOIUrl":"https://doi.org/10.1007/s41870-022-00987-w","url":null,"abstract":"<p><p>Credit card fraud is a growing problem nowadays and it has escalated during COVID-19 due to the authorities in many countries requiring people to use cashless transactions. Every year, billions of Euros are lost due to credit card fraud transactions, therefore, fraud detection systems are essential for financial institutions. As the classes' distribution is not equally represented in the credit card dataset, the machine learning trains the model according to the majority class which leads to inaccurate fraud predictions. For that, in this research, we mainly focus on processing unbalanced data by using an under-sampling technique to get more accurate and better results with different machine learning algorithms. We propose a framework that is based on clustering the dataset using fuzzy C-means and selecting similar fraud and normal instances that have the same features, which guarantees the integrity between the data features.</p>","PeriodicalId":73455,"journal":{"name":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","volume":"15 1","pages":"325-333"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9209320/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10650975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Improved local descriptor (ILD): a novel fusion method in face recognition. 改进的局部描述符(ILD):一种新的人脸识别融合方法。
Shekhar Karanwal

Literature suggests that by fusing multiple features there is immense improvement in the recognition rates as compared to the recognition rates of single descriptor. This motivate researchers to develop more and more fused descriptors by joining multiple features. Inspiring from the literature work, the proposed work launch novel local descriptor so-called Improved Local Descriptor (ILD), by joining features of 4 local descriptors. These are LBP, ELBP, MBP and LPQ. LBP captures local details. ELBP capture robust features in horizontal and vertical directions (elliptically) by using 3 × 5 and 5 × 3 patches. MBP minimizes image noise by median comparison to all the pixels and LPQ quantize the frequency components for obtaining feature size. These essential merits of 4 descriptors are encapsulated in one framework in the form of histogram feature. PCA is used further for compression and SVMs and NN are used for classification. Results on ORL, GT and Faces94 confirms strength of ILD, which beats separately implemented descriptors and various benchmark methods.

文献表明,通过融合多个特征,与单个描述符的识别率相比,识别率有了巨大的提高。这促使研究人员通过连接多个特征来开发越来越多的融合描述符。受文献工作的启发,该工作通过结合4个局部描述符的特征,推出了新的局部描述符,即改进的局部描述符(ILD)。这些是LBP、ELBP、MBP和LPQ。LBP捕获本地详细信息。ELBP通过使用3 × 5和5 × 3个补丁。MBP通过与所有像素进行中值比较来最小化图像噪声,并且LPQ量化频率分量以获得特征大小。4个描述符的这些基本优点以直方图特征的形式封装在一个框架中。PCA被进一步用于压缩,SVM和NN被用于分类。ORL、GT和Faces94的结果证实了ILD的强度,它击败了单独实现的描述符和各种基准方法。
{"title":"Improved local descriptor (ILD): a novel fusion method in face recognition.","authors":"Shekhar Karanwal","doi":"10.1007/s41870-023-01245-3","DOIUrl":"10.1007/s41870-023-01245-3","url":null,"abstract":"<p><p>Literature suggests that by fusing multiple features there is immense improvement in the recognition rates as compared to the recognition rates of single descriptor. This motivate researchers to develop more and more fused descriptors by joining multiple features. Inspiring from the literature work, the proposed work launch novel local descriptor so-called Improved Local Descriptor (ILD), by joining features of 4 local descriptors. These are LBP, ELBP, MBP and LPQ. LBP captures local details. ELBP capture robust features in horizontal and vertical directions (elliptically) by using 3 × 5 and 5 × 3 patches. MBP minimizes image noise by median comparison to all the pixels and LPQ quantize the frequency components for obtaining feature size. These essential merits of 4 descriptors are encapsulated in one framework in the form of histogram feature. PCA is used further for compression and SVMs and NN are used for classification. Results on ORL, GT and Faces94 confirms strength of ILD, which beats separately implemented descriptors and various benchmark methods.</p>","PeriodicalId":73455,"journal":{"name":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","volume":"15 4","pages":"1885-1894"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10106113/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9554057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel GCL hybrid classification model for paddy diseases. 水稻病害的GCL杂交分类新模型。
Shweta Lamba, Anupam Baliyan, Vinay Kukreja

The demand for agricultural products increased exponentially as the global population grew. The rapid development of computer vision-based artificial intelligence and deep learning-related technologies has impacted a wide range of industries, including disease detection and classification. This paper introduces a novel neural network-based hybrid model (GCL). GCL is a dataset-augmentation fusion of long-short term memory (LSTM) and convolutional neural network (CNN) with generative adversarial network (GAN). GAN is used for the augmentation of the dataset, CNN extracts the features and LSTM classifies the various paddy diseases. The GCL model is being investigated to improve the classification model's accuracy and reliability. The dataset was compiled using secondary resources such as Mendeley, Kaggle, UCI, and GitHub, having images of bacterial blight, leaf smut, and rice blast. The experimental setup for proving the efficacy of the GCL model demonstrates that the GCL is suitable for disease classification and works with 97% testing accuracy. GCL can further be used for the classification of more diseases of paddy.

随着全球人口的增长,对农产品的需求呈指数级增长。基于计算机视觉的人工智能和深度学习相关技术的快速发展已经影响了包括疾病检测和分类在内的广泛行业。提出了一种新的基于神经网络的混合模型。GCL是一种长短期记忆(LSTM)、卷积神经网络(CNN)和生成对抗网络(GAN)的数据集增强融合。利用GAN对数据集进行增强,CNN提取特征,LSTM对各种水稻病害进行分类。为了提高分类模型的准确性和可靠性,对GCL模型进行了研究。该数据集是利用Mendeley, Kaggle, UCI和GitHub等二手资源编制的,其中有细菌性疫病,叶黑穗病和稻瘟病的图像。验证GCL模型有效性的实验装置表明,GCL模型适用于疾病分类,测试准确率为97%。GCL可进一步用于水稻病害的分类。
{"title":"A novel GCL hybrid classification model for paddy diseases.","authors":"Shweta Lamba,&nbsp;Anupam Baliyan,&nbsp;Vinay Kukreja","doi":"10.1007/s41870-022-01094-6","DOIUrl":"https://doi.org/10.1007/s41870-022-01094-6","url":null,"abstract":"<p><p>The demand for agricultural products increased exponentially as the global population grew. The rapid development of computer vision-based artificial intelligence and deep learning-related technologies has impacted a wide range of industries, including disease detection and classification. This paper introduces a novel neural network-based hybrid model (GCL). GCL is a dataset-augmentation fusion of long-short term memory (LSTM) and convolutional neural network (CNN) with generative adversarial network (GAN). GAN is used for the augmentation of the dataset, CNN extracts the features and LSTM classifies the various paddy diseases. The GCL model is being investigated to improve the classification model's accuracy and reliability. The dataset was compiled using secondary resources such as Mendeley, Kaggle, UCI, and GitHub, having images of bacterial blight, leaf smut, and rice blast. The experimental setup for proving the efficacy of the GCL model demonstrates that the GCL is suitable for disease classification and works with 97% testing accuracy. GCL can further be used for the classification of more diseases of paddy.</p>","PeriodicalId":73455,"journal":{"name":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","volume":"15 2","pages":"1127-1136"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9484355/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10829992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Editorial. 社论。
M N Hoda
{"title":"Editorial.","authors":"M N Hoda","doi":"10.1007/s41870-023-01182-1","DOIUrl":"https://doi.org/10.1007/s41870-023-01182-1","url":null,"abstract":"","PeriodicalId":73455,"journal":{"name":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","volume":"15 2","pages":"545-548"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9943031/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10831257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Closed-set automatic speaker identification using multi-scale recurrent networks in non-native children. 在非母语儿童中使用多尺度递归网络进行封闭集自动语音识别。
Kodali Radha, Mohan Bansal

Children may benefit from automatic speaker identification in a variety of applications, including child security, safety, and education. The key focus of this study is to develop a closed-set child speaker identification system for non-native speakers of English in both text-dependent and text-independent speech tasks in order to track how the speaker's fluency affects the system. The multi-scale wavelet scattering transform is used to compensate for concerns like the loss of high-frequency information caused by the most widely used mel frequency cepstral coefficients feature extractor. The proposed large-scale speaker identification system succeeds well by employing wavelet scattered Bi-LSTM. While this procedure is used to identify non-native children in multiple classes, average values of accuracy, precision, recall, and F-measure are being used to assess the performance of the model in text-independent and text-dependent tasks, which outperforms the existing models.

在儿童安保、安全和教育等多种应用中,自动识别说话者可能会使儿童受益。本研究的重点是为非英语母语者开发一个封闭集儿童说话者识别系统,在依赖文本和不依赖文本的语音任务中跟踪说话者的流利程度对系统的影响。多尺度小波散射变换用于弥补最广泛使用的融频倒频谱系数特征提取器造成的高频信息丢失等问题。通过采用小波散射 Bi-LSTM 技术,拟议的大规模说话者识别系统取得了成功。该程序用于识别多个类别中的非母语儿童,准确率、精确率、召回率和 F-measure 的平均值被用来评估模型在与文本无关和与文本有关的任务中的性能,该模型的性能优于现有模型。
{"title":"Closed-set automatic speaker identification using multi-scale recurrent networks in non-native children.","authors":"Kodali Radha, Mohan Bansal","doi":"10.1007/s41870-023-01224-8","DOIUrl":"10.1007/s41870-023-01224-8","url":null,"abstract":"<p><p>Children may benefit from automatic speaker identification in a variety of applications, including child security, safety, and education. The key focus of this study is to develop a closed-set child speaker identification system for non-native speakers of English in both text-dependent and text-independent speech tasks in order to track how the speaker's fluency affects the system. The multi-scale wavelet scattering transform is used to compensate for concerns like the loss of high-frequency information caused by the most widely used mel frequency cepstral coefficients feature extractor. The proposed large-scale speaker identification system succeeds well by employing wavelet scattered Bi-LSTM. While this procedure is used to identify non-native children in multiple classes, average values of accuracy, precision, recall, and F-measure are being used to assess the performance of the model in text-independent and text-dependent tasks, which outperforms the existing models.</p>","PeriodicalId":73455,"journal":{"name":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","volume":"15 3","pages":"1375-1385"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10023307/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9298354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel stock counting system for detecting lot numbers using Tesseract OCR. 一种利用Tesseract OCR检测批号的新型库存计数系统。
Parkpoom Lertsawatwicha, Phumidon Phathong, Napatsorn Tantasanee, Kotchakorn Sarawutthinun, Thitirat Siriborvornratanakul

Counting stock is one of the warehouse's methods for preventing insatiable stock. Moreover, it could help the company forecast how many products they need to store and predict the replenished goods for customers. However, stock count in the medical business, which sells specialized medical equipment, needs more focus on, because it uses to treat the patient. So that lack of inventory should not happen. In a normal situation, stock count at some hospitals is quite hard for salespeople, especially hospitals in upcountry that far away. During the COVID-19 situation, many limits need to be strict. At this point, it causes a shortage of goods in many hospitals. In this paper, we represent how computer vision can help this process. When the hospital's officer sends images of stock to our system. The system will recognize the quantity and lot number of goods that remain in the hospital. Therefore, salespeople can decrease the times to visit hospitals. The result showed that for text detection and text recognition in a specific use case. Our prototype system achieves 84.17% in accuracy.

盘点库存是仓库防止库存贪得无厌的方法之一。此外,它可以帮助公司预测他们需要储存多少产品,并预测为客户补充的货物。然而,销售专业医疗设备的医疗业务的库存数量需要更多的关注,因为它用于治疗患者。所以库存不足不应该发生。在正常情况下,一些医院的库存清点对销售人员来说是相当困难的,尤其是在内陆那么远的医院。在COVID-19形势下,有许多限制需要严格。在这一点上,它造成了许多医院物资短缺。在本文中,我们展示了计算机视觉如何帮助这一过程。当医院的工作人员将库存图片发送到我们的系统时。系统将识别留在医院的货物数量和批号。因此,销售人员可以减少去医院的次数。结果表明,对于文本检测和文本识别具有特定的用例。我们的原型系统达到了84.17%的准确率。
{"title":"A novel stock counting system for detecting lot numbers using Tesseract OCR.","authors":"Parkpoom Lertsawatwicha,&nbsp;Phumidon Phathong,&nbsp;Napatsorn Tantasanee,&nbsp;Kotchakorn Sarawutthinun,&nbsp;Thitirat Siriborvornratanakul","doi":"10.1007/s41870-022-01107-4","DOIUrl":"https://doi.org/10.1007/s41870-022-01107-4","url":null,"abstract":"<p><p>Counting stock is one of the warehouse's methods for preventing insatiable stock. Moreover, it could help the company forecast how many products they need to store and predict the replenished goods for customers. However, stock count in the medical business, which sells specialized medical equipment, needs more focus on, because it uses to treat the patient. So that lack of inventory should not happen. In a normal situation, stock count at some hospitals is quite hard for salespeople, especially hospitals in upcountry that far away. During the COVID-19 situation, many limits need to be strict. At this point, it causes a shortage of goods in many hospitals. In this paper, we represent how computer vision can help this process. When the hospital's officer sends images of stock to our system. The system will recognize the quantity and lot number of goods that remain in the hospital. Therefore, salespeople can decrease the times to visit hospitals. The result showed that for text detection and text recognition in a specific use case. Our prototype system achieves 84.17% in accuracy.</p>","PeriodicalId":73455,"journal":{"name":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","volume":"15 1","pages":"393-398"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9540281/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10650744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Editorial. 社论。
M N Hoda
{"title":"Editorial.","authors":"M N Hoda","doi":"10.1007/s41870-023-01239-1","DOIUrl":"https://doi.org/10.1007/s41870-023-01239-1","url":null,"abstract":"","PeriodicalId":73455,"journal":{"name":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","volume":"15 3","pages":"1201-1204"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10068186/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9347442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An integrated clustering and BERT framework for improved topic modeling. 用于改进主题建模的集成集群和BERT框架。
Lijimol George, P Sumathy

Topic modelling is a machine learning technique that is extensively used in Natural Language Processing (NLP) applications to infer topics within unstructured textual data. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a huge collection of text documents. However, the LDA-based topic models alone do not always provide promising results. Clustering is one of the effective unsupervised machine learning algorithms that are extensively used in applications including extracting information from unstructured textual data and topic modeling. A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling with clustering based on dimensionality reduction have been studied in detail. As the clustering algorithms are computationally complex, the complexity increases with the higher number of features, the PCA, t-SNE and UMAP based dimensionality reduction methods are also performed. Finally, a unified clustering-based framework using BERT and LDA is proposed as part of this study for mining a set of meaningful topics from the massive text corpora. The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework using BERT and LDA by simulating user input on benchmark datasets. The experimental results show that clustering with dimensionality reduction would help infer more coherent topics and hence this unified clustering and BERT-LDA based approach can be effectively utilized for building topic modeling applications.

主题建模是一种机器学习技术,广泛用于自然语言处理(NLP)应用程序,以推断非结构化文本数据中的主题。潜在狄利克雷分配(LDA)是最常用的主题建模技术之一,可以自动检测大量文本文档中的主题。然而,基于LDA的主题模型本身并不总是提供有希望的结果。聚类是一种有效的无监督机器学习算法,广泛应用于从非结构化文本数据中提取信息和主题建模等领域。详细研究了基于降维聚类的主题建模中来自变换器的双向编码器表示(BERT)和潜在狄利克雷分配(LDA)的混合模型。由于聚类算法计算复杂,复杂度随着特征数量的增加而增加,因此还执行了基于PCA、t-SNE和UMAP的降维方法。最后,提出了一个基于BERT和LDA的统一聚类框架,用于从海量文本语料库中挖掘一组有意义的主题。通过在基准数据集上模拟用户输入,实验证明了使用BERT和LDA的集群知情主题建模框架的有效性。实验结果表明,降维聚类有助于推断出更连贯的主题,因此这种统一的聚类和基于BERT-LDA的方法可以有效地用于构建主题建模应用程序。
{"title":"An integrated clustering and BERT framework for improved topic modeling.","authors":"Lijimol George,&nbsp;P Sumathy","doi":"10.1007/s41870-023-01268-w","DOIUrl":"10.1007/s41870-023-01268-w","url":null,"abstract":"<p><p>Topic modelling is a machine learning technique that is extensively used in Natural Language Processing (NLP) applications to infer topics within unstructured textual data. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a huge collection of text documents. However, the LDA-based topic models alone do not always provide promising results. Clustering is one of the effective unsupervised machine learning algorithms that are extensively used in applications including extracting information from unstructured textual data and topic modeling. A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling with clustering based on dimensionality reduction have been studied in detail. As the clustering algorithms are computationally complex, the complexity increases with the higher number of features, the PCA, t-SNE and UMAP based dimensionality reduction methods are also performed. Finally, a unified clustering-based framework using BERT and LDA is proposed as part of this study for mining a set of meaningful topics from the massive text corpora. The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework using BERT and LDA by simulating user input on benchmark datasets. The experimental results show that clustering with dimensionality reduction would help infer more coherent topics and hence this unified clustering and BERT-LDA based approach can be effectively utilized for building topic modeling applications.</p>","PeriodicalId":73455,"journal":{"name":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","volume":"15 4","pages":"2187-2195"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10163298/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9554064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Detecting influential nodes with topological structure via Graph Neural Network approach in social networks. 利用图神经网络方法检测社交网络中具有拓扑结构的影响节点。
Riju Bhattacharya, Naresh Kumar Nagwani, Sarsij Tripathi

Detecting influential nodes in complex social networks is crucial due to the enormous amount of data and the constantly changing behavior of existing topologies. Centrality-based and machine-learning approaches focus mostly on node topologies or feature values in their evaluation of nodes' relevance. However, both network topologies and node attributes should be taken into account when determining the influential value of nodes. This research has proposed a deep learning model called Graph Convolutional Networks (GCN) to discover the significant nodes in graph-based large datasets. A deep learning framework for identifying influential nodes with structural centrality via Graph Convolutional Networks called DeepInfNode has been developed. The proposed approach measures up contextual information from Susceptible-Infected-Recovered (SIR) model trials to measure the rate of infection to develop node representations. In the experimental section, acquired experimental results indicate that the suggested model has a higher F1 and Area under the curve (AUC) value. The findings indicate that the strategy is both effective and precise in terms of suggesting new linkages. The proposed DeepInfNode model outperforms state-of-the-art approaches on a variety of publicly available standard graph datasets, achieving an increase in performance of up to 99.1% of accuracy.

由于庞大的数据量和现有拓扑结构不断变化的行为,检测复杂社交网络中有影响力的节点至关重要。基于中心性和机器学习方法在评估节点相关性时主要关注节点拓扑或特征值。然而,在确定节点的影响值时,应同时考虑网络拓扑和节点属性。本研究提出了一种称为图卷积网络(GCN)的深度学习模型,用于发现基于图的大型数据集中的重要节点。已经开发了一个深度学习框架,用于通过图卷积网络识别具有结构中心性的有影响力的节点,称为DeepInfNode。所提出的方法从易感感染恢复(SIR)模型试验中测量上下文信息,以测量感染率,从而开发节点表示。在实验部分,获得的实验结果表明,所提出的模型具有较高的F1和曲线下面积(AUC)值。调查结果表明,该战略在提出新的联系方面既有效又准确。所提出的DeepInfNode模型在各种公开可用的标准图数据集上优于最先进的方法,实现了高达99.1%的准确率的性能提高。
{"title":"Detecting influential nodes with topological structure via Graph Neural Network approach in social networks.","authors":"Riju Bhattacharya, Naresh Kumar Nagwani, Sarsij Tripathi","doi":"10.1007/s41870-023-01271-1","DOIUrl":"10.1007/s41870-023-01271-1","url":null,"abstract":"<p><p>Detecting influential nodes in complex social networks is crucial due to the enormous amount of data and the constantly changing behavior of existing topologies. Centrality-based and machine-learning approaches focus mostly on node topologies or feature values in their evaluation of nodes' relevance. However, both network topologies and node attributes should be taken into account when determining the influential value of nodes. This research has proposed a deep learning model called Graph Convolutional Networks (GCN) to discover the significant nodes in graph-based large datasets. A deep learning framework for identifying influential nodes with structural centrality via Graph Convolutional Networks called DeepInfNode has been developed. The proposed approach measures up contextual information from Susceptible-Infected-Recovered (SIR) model trials to measure the rate of infection to develop node representations. In the experimental section, acquired experimental results indicate that the suggested model has a higher F1 and Area under the curve (AUC) value. The findings indicate that the strategy is both effective and precise in terms of suggesting new linkages. The proposed DeepInfNode model outperforms state-of-the-art approaches on a variety of publicly available standard graph datasets, achieving an increase in performance of up to 99.1% of accuracy.</p>","PeriodicalId":73455,"journal":{"name":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","volume":"15 4","pages":"2233-2246"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10163927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9556842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1