Pub Date : 2020-10-01DOI: 10.4018/ijdwm.2020100101
M. Azabou, Ameen Banjar, J. Feki
The data warehouse community has paid particular attention to the document warehouse (DocW) paradigm during the last two decades. However, some important issues related to the semantics are still pending and therefore need a deep research investigation. Indeed, the semantic exploitation of the DocW is not yet mature despite it representing a main concern for decision-makers. This paper aims to enhancing the multidimensional model called Diamond Document Warehouse Model with semantics aspects; in particular, it suggests semantic OLAP (on-line analytical processing) operators for querying the DocW.
{"title":"Enhancing the Diamond Document Warehouse Model","authors":"M. Azabou, Ameen Banjar, J. Feki","doi":"10.4018/ijdwm.2020100101","DOIUrl":"https://doi.org/10.4018/ijdwm.2020100101","url":null,"abstract":"The data warehouse community has paid particular attention to the document warehouse (DocW) paradigm during the last two decades. However, some important issues related to the semantics are still pending and therefore need a deep research investigation. Indeed, the semantic exploitation of the DocW is not yet mature despite it representing a main concern for decision-makers. This paper aims to enhancing the multidimensional model called Diamond Document Warehouse Model with semantics aspects; in particular, it suggests semantic OLAP (on-line analytical processing) operators for querying the DocW.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85634852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.4018/ijdwm.2020100105
Xiaodi Huang, Minglun Ren, Zhongfeng Hu
The process of K-medoids algorithm is that it first selects data randomly as initial centers to form initial clusters. Then, based on PAM (partitioning around medoids) algorithm, centers will be sequential replaced by all the remaining data to find a result has the best inherent convergence. Since PAM algorithm is an iterative ergodic strategy, when the data size or the number of clusters are huge, its expensive computational overhead will hinder its feasibility. The authors use the fixed-point iteration to search the optimal clustering centers and build a FPK-medoids (fixed point-based K-medoids) algorithm. By constructing fixed point equations for each cluster, the problem of searching optimal centers is converted into the solving of equation set in parallel. The experiment is carried on six standard datasets, and the result shows that the clustering efficiency of proposed algorithm is significantly improved compared with the conventional algorithm. In addition, the clustering quality will be markedly enhanced in handling problems with large-scale datasets or a large number of clusters.
K-medoids算法的过程是首先随机选择数据作为初始中心,形成初始聚类。然后,基于PAM (partitioning around medioids)算法,将中心依次替换为所有剩余数据,以寻找具有最佳固有收敛性的结果。由于PAM算法是一种迭代遍历策略,当数据大小或集群数量较大时,其昂贵的计算开销将阻碍其可行性。利用不动点迭代法搜索最优聚类中心,建立了基于不动点的K-medoids算法。通过构造每个聚类的不动点方程,将寻找最优中心的问题转化为求解并行方程组的问题。在6个标准数据集上进行了实验,结果表明,与传统算法相比,本文算法的聚类效率有显著提高。此外,在处理大规模数据集或大量聚类问题时,聚类质量将显著提高。
{"title":"An Improvement of K-Medoids Clustering Algorithm Based on Fixed Point Iteration","authors":"Xiaodi Huang, Minglun Ren, Zhongfeng Hu","doi":"10.4018/ijdwm.2020100105","DOIUrl":"https://doi.org/10.4018/ijdwm.2020100105","url":null,"abstract":"The process of K-medoids algorithm is that it first selects data randomly as initial centers to form initial clusters. Then, based on PAM (partitioning around medoids) algorithm, centers will be sequential replaced by all the remaining data to find a result has the best inherent convergence. Since PAM algorithm is an iterative ergodic strategy, when the data size or the number of clusters are huge, its expensive computational overhead will hinder its feasibility. The authors use the fixed-point iteration to search the optimal clustering centers and build a FPK-medoids (fixed point-based K-medoids) algorithm. By constructing fixed point equations for each cluster, the problem of searching optimal centers is converted into the solving of equation set in parallel. The experiment is carried on six standard datasets, and the result shows that the clustering efficiency of proposed algorithm is significantly improved compared with the conventional algorithm. In addition, the clustering quality will be markedly enhanced in handling problems with large-scale datasets or a large number of clusters.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85850552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.4018/ijdwm.2020100106
Wallace A. Pinheiro, G. Xexéo, J. Souza, A. B. Pinheiro
This work proposes a methodology applied to repositories modeled using star schemas, such as data marts, to discover relevant time series relations. This paper applies a set of measures related to association, correlation, and causality to create connections among data. In this context, the research proposes a new causality function based on peaks and values that relate coherently time series. To evaluate the approach, the authors use a set of experiments exploring time series about a particular neglected disease that affects several Brazilian cities called American Tegumentary Leishmaniasis and time series about the climate of some cities in Brazil. The authors populate data marts with these data, and the proposed methodology has generated a set of relations linking the notifications of this disease to the variation of temperature and pluviometry.
{"title":"Data Discovery Over Time Series From Star Schemas Based on Association, Correlation, and Causality","authors":"Wallace A. Pinheiro, G. Xexéo, J. Souza, A. B. Pinheiro","doi":"10.4018/ijdwm.2020100106","DOIUrl":"https://doi.org/10.4018/ijdwm.2020100106","url":null,"abstract":"This work proposes a methodology applied to repositories modeled using star schemas, such as data marts, to discover relevant time series relations. This paper applies a set of measures related to association, correlation, and causality to create connections among data. In this context, the research proposes a new causality function based on peaks and values that relate coherently time series. To evaluate the approach, the authors use a set of experiments exploring time series about a particular neglected disease that affects several Brazilian cities called American Tegumentary Leishmaniasis and time series about the climate of some cities in Brazil. The authors populate data marts with these data, and the proposed methodology has generated a set of relations linking the notifications of this disease to the variation of temperature and pluviometry.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83654658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.4018/ijdwm.2020070103
Cheng-Hsiung Weng, Cheng-Kui Huang
Formulating different marketing strategies to apply to various market segments is a noteworthy undertaking for marketing managers. Accordingly, marketing managers should identify sales patterns among different market segments. The study initially applies the concept of recency–frequency–monetary (RFM) scores to segment transaction datasets into several sub-datasets (market segments) and discovers RFM itemsets from these market segments. In addition, three sales features (unique, common, and particular sales patterns) are defined to identify various sales patterns in this study. In particular, a new criterion (contrast support) is also proposed to discover notable sales patterns among different market segments. This study develops an algorithm, called sales pattern mining (SPMING), for discovering RFM itemsets from several RFM-based market segments and then identifying unique, common, and particular sales patterns. The experimental results from two real datasets show that the SPMING algorithm can discover specific sales patterns in various market segments.
{"title":"Discovering Specific Sales Patterns Among Different Market Segments","authors":"Cheng-Hsiung Weng, Cheng-Kui Huang","doi":"10.4018/ijdwm.2020070103","DOIUrl":"https://doi.org/10.4018/ijdwm.2020070103","url":null,"abstract":"Formulating different marketing strategies to apply to various market segments is a noteworthy undertaking for marketing managers. Accordingly, marketing managers should identify sales patterns among different market segments. The study initially applies the concept of recency–frequency–monetary (RFM) scores to segment transaction datasets into several sub-datasets (market segments) and discovers RFM itemsets from these market segments. In addition, three sales features (unique, common, and particular sales patterns) are defined to identify various sales patterns in this study. In particular, a new criterion (contrast support) is also proposed to discover notable sales patterns among different market segments. This study develops an algorithm, called sales pattern mining (SPMING), for discovering RFM itemsets from several RFM-based market segments and then identifying unique, common, and particular sales patterns. The experimental results from two real datasets show that the SPMING algorithm can discover specific sales patterns in various market segments.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81898263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.4018/ijdwm.2020070102
Zheng Wang, Qiao Wang, Tanjie Zhu, Xiaojun Ye
Network embedding is a fundamental problem in network research. Semi-supervised network embedding, which benefits from labeled data, has recently attracted considerable interest. However, existing semi-supervised methods would get biased results in the completely-imbalanced label setting where labeled data cannot cover all classes. This article proposes a novel network embedding method which could benefit from completely-imbalanced labels by approximately guaranteeing both intra-class similarity and inter-class dissimilarity. In addition, the authors prove and adopt the matrix factorization form of LINE (a famous network embedding method) as the network structure preserving model. Extensive experiments demonstrate the superiority and robustness of this method.
{"title":"Extending LINE for Network Embedding With Completely Imbalanced Labels","authors":"Zheng Wang, Qiao Wang, Tanjie Zhu, Xiaojun Ye","doi":"10.4018/ijdwm.2020070102","DOIUrl":"https://doi.org/10.4018/ijdwm.2020070102","url":null,"abstract":"Network embedding is a fundamental problem in network research. Semi-supervised network embedding, which benefits from labeled data, has recently attracted considerable interest. However, existing semi-supervised methods would get biased results in the completely-imbalanced label setting where labeled data cannot cover all classes. This article proposes a novel network embedding method which could benefit from completely-imbalanced labels by approximately guaranteeing both intra-class similarity and inter-class dissimilarity. In addition, the authors prove and adopt the matrix factorization form of LINE (a famous network embedding method) as the network structure preserving model. Extensive experiments demonstrate the superiority and robustness of this method.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83912304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Method for Classifying Function of Spatial Regions Based on Two Sets of Characteristics Indicated by Trajectories","authors":"Haitao Zhang, Che Yu, Yan Jin","doi":"10.4018/ijdwm.2020070101","DOIUrl":"https://doi.org/10.4018/ijdwm.2020070101","url":null,"abstract":"Trajectoryisasignificantfactorforclassifyingfunctionsofspatialregions.Manyspatialclassification methods use trajectories to detect buildings and districts in urban settings. However, methods thatonly take intoconsideration the localspatiotemporalcharacteristics indicatedby trajectories maygenerateinaccurateresults.Inthisarticle,anovelmethodforclassifyingfunctionofspatial regionsbasedontwosetsofcharacteristicsindicatedbytrajectoriesisproposed,inwhichthelocal spatiotemporalcharacteristicsaswellastheglobalconnectioncharacteristicsareobtainedthrough twosetsofcalculations.Themethodwasevaluatedintwoexperiments:onethatmeasuredchanges in theclassificationmetric throughasplits ratiofactor,andone thatcompared theclassification performancebetweentheproposedmethodandmethodsbasedonasinglesetofcharacteristics.The resultsshowedthattheproposedmethodismoreaccuratethanthetwotraditionalmethods,witha precisionvalueof0.93,arecallvalueof0.77,andanF-Measurevalueof0.84. KeyWoRDS Function of Spatial Regions, Global Connection Characteristics, Local Spatiotemporal Characteristics, Spatial Classification, Trajectory","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77056513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.4018/ijdwm.2020070104
D. Devi, S. Namasudra, Seifedine Kadry
The subject of a class imbalance is a well-investigated topic which addresses performance degradation of standard learning models due to uneven distribution of classes in a dataspace. Cluster-based undersampling is a popular solution in the domain which offers to eliminate majority class instances from a definite number of clusters to balance the training data. However, distance-based elimination of instances often got affected by the underlying data distribution. Recently, ensemble learning techniques have emerged as effective solution due to its weighted learning principle of rare instances. In this article, a boosting aided adaptive cluster-based undersampling technique is proposed to facilitate elimination of learning- insignificant majority class instances from the clusters, detected through AdaBoost ensemble learning model. The proposed work is validated with seven existing cluster based undersampling techniques for six binary datasets and three classification models. The experimental results have established the effectives of the proposed technique than the existing methods.
{"title":"A Boosting-Aided Adaptive Cluster-Based Undersampling Approach for Treatment of Class Imbalance Problem","authors":"D. Devi, S. Namasudra, Seifedine Kadry","doi":"10.4018/ijdwm.2020070104","DOIUrl":"https://doi.org/10.4018/ijdwm.2020070104","url":null,"abstract":"The subject of a class imbalance is a well-investigated topic which addresses performance degradation of standard learning models due to uneven distribution of classes in a dataspace. Cluster-based undersampling is a popular solution in the domain which offers to eliminate majority class instances from a definite number of clusters to balance the training data. However, distance-based elimination of instances often got affected by the underlying data distribution. Recently, ensemble learning techniques have emerged as effective solution due to its weighted learning principle of rare instances. In this article, a boosting aided adaptive cluster-based undersampling technique is proposed to facilitate elimination of learning- insignificant majority class instances from the clusters, detected through AdaBoost ensemble learning model. The proposed work is validated with seven existing cluster based undersampling techniques for six binary datasets and three classification models. The experimental results have established the effectives of the proposed technique than the existing methods.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81780316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.4018/ijdwm.2020070110
Latha Banda, Karan Singh, Le Hoang Son, Mohamed Abdel-Basset, Pham Huy Thong, H. Huynh, D. Taniar
Collaborative tagging is a useful and effective way for classifying items with respect to search, sharing information so that users can be tagged via online social networking. This article proposes a novel recommender system for collaborative tagging in which the genre interestingness measure and gradual decay are utilized with diffusion similarity. The comparison has been done on the benchmark recommender system datasets namely MovieLens, Amazon datasets against the existing approaches such as collaborative filtering based on tagging using E-FCM, and E-GK clustering algorithms, hybrid recommender systems based on tagging using GA and collaborative tagging using incremental clustering with trust. The experimental results ensure that the proposed approach achieves maximum prediction accuracy ratio of 9.25% for average of various splits data of 100 users, which is higher than the existing approaches obtained only prediction accuracy of 5.76%.
{"title":"Recommender Systems Using Collaborative Tagging","authors":"Latha Banda, Karan Singh, Le Hoang Son, Mohamed Abdel-Basset, Pham Huy Thong, H. Huynh, D. Taniar","doi":"10.4018/ijdwm.2020070110","DOIUrl":"https://doi.org/10.4018/ijdwm.2020070110","url":null,"abstract":"Collaborative tagging is a useful and effective way for classifying items with respect to search, sharing information so that users can be tagged via online social networking. This article proposes a novel recommender system for collaborative tagging in which the genre interestingness measure and gradual decay are utilized with diffusion similarity. The comparison has been done on the benchmark recommender system datasets namely MovieLens, Amazon datasets against the existing approaches such as collaborative filtering based on tagging using E-FCM, and E-GK clustering algorithms, hybrid recommender systems based on tagging using GA and collaborative tagging using incremental clustering with trust. The experimental results ensure that the proposed approach achieves maximum prediction accuracy ratio of 9.25% for average of various splits data of 100 users, which is higher than the existing approaches obtained only prediction accuracy of 5.76%.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87852321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Serialized Co-Training-Based Recognition of Medicine Names for Patent Mining and Retrieval","authors":"Na Deng, Caiquan Xiong","doi":"10.4018/ijdwm.2020070105","DOIUrl":"https://doi.org/10.4018/ijdwm.2020070105","url":null,"abstract":"IntheretrievalandminingoftraditionalChinesemedicine(TCM)patents,akeystepisChineseword segmentationandnamedentityrecognition.However,thealiasphenomenonoftraditionalChinese medicinescausesgreatchallengestoChinesewordsegmentationandnamedentityrecognitioninTCM patents,whichdirectlyaffectstheeffectofpatentmining.Becauseofthelackofacomprehensive Chineseherbalmedicinenamethesaurus,traditionalthesaurus-basedChinesewordsegmentation andnamedentityrecognitionarenotsuitableformedicineidentificationinTCMpatents.Inviewof thepresentsituation,usingthelanguagecharacteristicsandstructuralcharacteristicsofTCMpatent texts,amodifiedandserializedco-trainingmethodtorecognizemedicinenamesfromTCMpatent abstract texts isproposed.Experimentsshowthat thismethodcanmaintainhighaccuracyunder relativelylowtimecomplexity.Inaddition,thismethodcanalsobeexpandedtotherecognitionof othernamedentitiesinTCMpatents,suchasdiseasenames,preparationmethods,andsoon. KeyWoRDS Annotation, Co-Training, Machine Learning, Medicine Name, Patent Mining, Patent Retrieval, Traditional Chinese Medicine","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73526987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}