首页 > 最新文献

2015 IEEE International Conference on Data Mining Workshop (ICDMW)最新文献

英文 中文
Citation Prediction Using Diverse Features 使用多种特征的引文预测
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.131
H. Bhat, Li-Hsuan Huang, Sebastian Rodriguez, Rick Dale, E. Heit
Using a large database of nearly 8 million bibliographic entries spanning over 3 million unique authors, we build predictive models to classify a paper based on its citation count. Our approach involves considering a diverse array of features including the interdisciplinarity of authors, which we quantify using Shannon entropy and Jensen-Shannon divergence. Rather than rely on subject codes, we model the disciplinary preferences of each author by estimating the author's journal distribution. We conduct an exploratory data analysis on the relationship between these interdisciplinarity variables and citation counts. In addition, we model the effects of (1) each author's influence in coauthorship graphs, and (2) words in the title of the paper. We then build classifiers for two-and three-class classification problems that correspond to predicting the interval in which a paper's citation count will lie. We use cross-validation and a true test set to tune model parameters and assess model performance. The best model we build, a classification tree, yields test set accuracies of 0.87 and 0.66, respectively. Using this model, we also provide rankings of attribute importance, for the three-class problem, these rankings indicate the importance of our interdisciplinarity metrics in predicting citation counts.
利用一个包含近800万个书目条目、300多万独立作者的大型数据库,我们建立了预测模型,根据引用次数对论文进行分类。我们的方法包括考虑多种特征,包括作者的跨学科性,我们使用香农熵和Jensen-Shannon散度对其进行量化。而不是依赖于学科代码,我们通过估计作者的期刊分布来建模每个作者的学科偏好。我们对这些跨学科变量与被引次数之间的关系进行了探索性数据分析。此外,我们对(1)每位作者在合作关系图中的影响力和(2)论文标题中的单词的影响进行了建模。然后,我们为两类和三类分类问题构建分类器,这些分类器对应于预测论文被引用次数的间隔。我们使用交叉验证和真实测试集来调整模型参数并评估模型性能。我们建立的最好的模型是一个分类树,它的测试集准确率分别为0.87和0.66。使用该模型,我们还提供了属性重要性排名,对于三类问题,这些排名表明我们的跨学科指标在预测引用数量方面的重要性。
{"title":"Citation Prediction Using Diverse Features","authors":"H. Bhat, Li-Hsuan Huang, Sebastian Rodriguez, Rick Dale, E. Heit","doi":"10.1109/ICDMW.2015.131","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.131","url":null,"abstract":"Using a large database of nearly 8 million bibliographic entries spanning over 3 million unique authors, we build predictive models to classify a paper based on its citation count. Our approach involves considering a diverse array of features including the interdisciplinarity of authors, which we quantify using Shannon entropy and Jensen-Shannon divergence. Rather than rely on subject codes, we model the disciplinary preferences of each author by estimating the author's journal distribution. We conduct an exploratory data analysis on the relationship between these interdisciplinarity variables and citation counts. In addition, we model the effects of (1) each author's influence in coauthorship graphs, and (2) words in the title of the paper. We then build classifiers for two-and three-class classification problems that correspond to predicting the interval in which a paper's citation count will lie. We use cross-validation and a true test set to tune model parameters and assess model performance. The best model we build, a classification tree, yields test set accuracies of 0.87 and 0.66, respectively. Using this model, we also provide rankings of attribute importance, for the three-class problem, these rankings indicate the importance of our interdisciplinarity metrics in predicting citation counts.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129240800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Application of Applied KOTO-FRAME to the Five-Story Pagoda Aseismatic Mechanism 应用KOTO-FRAME在五层塔抗震机构中的应用
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.173
Masahiko Teramoto, Jun Nakamura
We have created and are proposing the KOTO-FRAME previously called dynamic quality function deployment (DQFD) technique, which evolved from quality function deployment (QFD). This method was applied to aseismatic mechanisms and is recognized as tacit knowledge for creating a logical structure from the architecture of a five-story pagoda by experimenting a model of the steric balancing toy principle. Consequently, without complex calculations, we were able to define the corresponding data structure in the attribution table of experiment or evaluation, which is worth applying not only to transact past data but also future data via experiment or evaluation utilizing the idea of "market of data."
我们已经创建并提出了KOTO-FRAME,以前称为动态质量功能部署(DQFD)技术,它是从质量功能部署(QFD)发展而来的。这种方法被应用于抗震机构,并被认为是隐性知识,通过实验立体平衡玩具原理的模型,从五层宝塔的建筑结构中创建逻辑结构。因此,无需复杂的计算,我们就可以在实验或评估归因表中定义相应的数据结构,利用“数据市场”的思想,不仅可以应用于处理过去的数据,也可以应用于通过实验或评估处理未来的数据。
{"title":"Application of Applied KOTO-FRAME to the Five-Story Pagoda Aseismatic Mechanism","authors":"Masahiko Teramoto, Jun Nakamura","doi":"10.1109/ICDMW.2015.173","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.173","url":null,"abstract":"We have created and are proposing the KOTO-FRAME previously called dynamic quality function deployment (DQFD) technique, which evolved from quality function deployment (QFD). This method was applied to aseismatic mechanisms and is recognized as tacit knowledge for creating a logical structure from the architecture of a five-story pagoda by experimenting a model of the steric balancing toy principle. Consequently, without complex calculations, we were able to define the corresponding data structure in the attribution table of experiment or evaluation, which is worth applying not only to transact past data but also future data via experiment or evaluation utilizing the idea of \"market of data.\"","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121448717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing the Transferability of Collective Inference Models Across Networks 集体推理模型跨网络可移植性分析
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.192
Ransen Niu, Sebastián Moreno, Jennifer Neville
Collective inference models have recently been used to significantly improve the predictive accuracy of node classifications in network domains. However, these methods have generally assumed a fully labeled network is available for learning. There has been relatively little work on transfer learning methods for collective classification, i.e., to exploit labeled data in one network domain to learn a collective classification model to apply in another network. While there has been some work on transfer learning for link prediction and node classification, the proposed methods focus on developing algorithms to adapt the models without a deep understanding of how the network structure impacts transferability. Here we make the key observation that collective classification models are generally composed of local model templates that are rolled out across a heterogeneous network to construct a larger model for inference. Thus, the transferability of a model could depend on similarity of the local model templates and/or the global structure of the data networks. In this work, we study the performance of basic relational models when learned on one network and transferred to another network to apply collective inference. We show, using both synthetic and real data experiments, that transferability of models depends on both the graph structure and local model parameters. Moreover, we show that a probability calibration process (that removes bias due to propagation errors in collective inference) improves transferability.
最近,集体推理模型被用于显著提高网络域中节点分类的预测精度。然而,这些方法通常假设一个完全标记的网络可用于学习。关于集体分类的迁移学习方法的研究相对较少,即利用一个网络领域中的标记数据来学习一个集体分类模型以应用于另一个网络。虽然已经有一些关于链路预测和节点分类的迁移学习的工作,但所提出的方法侧重于开发适应模型的算法,而没有深入了解网络结构如何影响可转移性。在这里,我们做出了关键的观察,即集体分类模型通常由局部模型模板组成,这些模板在异构网络上展开,以构建更大的模型进行推理。因此,模型的可转移性可能取决于局部模型模板和/或数据网络的全局结构的相似性。在这项工作中,我们研究了基本关系模型在一个网络上学习并转移到另一个网络上应用集体推理时的性能。我们通过合成和真实数据实验证明,模型的可转移性取决于图结构和局部模型参数。此外,我们还证明了概率校准过程(消除了集体推理中由于传播误差引起的偏差)提高了可转移性。
{"title":"Analyzing the Transferability of Collective Inference Models Across Networks","authors":"Ransen Niu, Sebastián Moreno, Jennifer Neville","doi":"10.1109/ICDMW.2015.192","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.192","url":null,"abstract":"Collective inference models have recently been used to significantly improve the predictive accuracy of node classifications in network domains. However, these methods have generally assumed a fully labeled network is available for learning. There has been relatively little work on transfer learning methods for collective classification, i.e., to exploit labeled data in one network domain to learn a collective classification model to apply in another network. While there has been some work on transfer learning for link prediction and node classification, the proposed methods focus on developing algorithms to adapt the models without a deep understanding of how the network structure impacts transferability. Here we make the key observation that collective classification models are generally composed of local model templates that are rolled out across a heterogeneous network to construct a larger model for inference. Thus, the transferability of a model could depend on similarity of the local model templates and/or the global structure of the data networks. In this work, we study the performance of basic relational models when learned on one network and transferred to another network to apply collective inference. We show, using both synthetic and real data experiments, that transferability of models depends on both the graph structure and local model parameters. Moreover, we show that a probability calibration process (that removes bias due to propagation errors in collective inference) improves transferability.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"23 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126320697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Detecting Multipliers of Jihadism on Twitter 在推特上发现圣战主义的倍增者
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.9
Lisa Kaati, Enghin Omer, Nico Prucha, A. Shrestha
Detecting terrorist related content on social media is a problem for law enforcement agency due to the large amount of information that is available. This work is aiming at detecting tweeps that are involved in media mujahideen - the supporters of jihadist groups who disseminate propaganda content online. To do this we use a machine learning approach where we make use of two sets of features: data dependent features and data independent features. The data dependent features are features that are heavily influenced by the specific dataset while the data independent features are independent of the dataset and can be used on other datasets with similar result. By using this approach we hope that our method can be used as a baseline to classify violent extremist content from different kind of sources since data dependent features from various domains can be added. In our experiments we have used the AdaBoost classifier. The results shows that our approach works very well for classifying English tweeps and English tweets but the approach does not perform as well on Arabic data.
在社交媒体上发现与恐怖主义有关的内容对执法机构来说是一个问题,因为可以获得大量的信息。这项工作旨在检测与媒体圣战者有关的推文——圣战组织的支持者在网上传播宣传内容。为了做到这一点,我们使用了一种机器学习方法,其中我们利用了两组特征:数据依赖特征和数据独立特征。数据依赖特征是受特定数据集严重影响的特征,而数据独立特征独立于数据集,可用于具有类似结果的其他数据集。通过使用这种方法,我们希望我们的方法可以作为基线来分类来自不同来源的暴力极端主义内容,因为可以添加来自不同领域的数据依赖特征。在实验中,我们使用了AdaBoost分类器。结果表明,我们的方法对英语推文和英语推文进行分类非常有效,但该方法在阿拉伯语数据上的表现不佳。
{"title":"Detecting Multipliers of Jihadism on Twitter","authors":"Lisa Kaati, Enghin Omer, Nico Prucha, A. Shrestha","doi":"10.1109/ICDMW.2015.9","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.9","url":null,"abstract":"Detecting terrorist related content on social media is a problem for law enforcement agency due to the large amount of information that is available. This work is aiming at detecting tweeps that are involved in media mujahideen - the supporters of jihadist groups who disseminate propaganda content online. To do this we use a machine learning approach where we make use of two sets of features: data dependent features and data independent features. The data dependent features are features that are heavily influenced by the specific dataset while the data independent features are independent of the dataset and can be used on other datasets with similar result. By using this approach we hope that our method can be used as a baseline to classify violent extremist content from different kind of sources since data dependent features from various domains can be added. In our experiments we have used the AdaBoost classifier. The results shows that our approach works very well for classifying English tweeps and English tweets but the approach does not perform as well on Arabic data.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116081070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Knowledge-Based Circulation Growth Model: Applying a Data Marketplace to Concept Design 基于知识的流通增长模型:将数据市场应用于概念设计
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.81
Jun Nakamura, Masahiko Teramoto
The authors introduce a growth model of the circulation of a stock and flow process that makes use of stock as an important factor in designing a data marketplace. The model highlights the necessity of considering how learning efficiency must be designed for the purpose of concept design. The model is applied to a business case in commercial industry to discuss the essentials of stock and learning efficiency with the aim of designing a data marketplace.
作者介绍了一个股票流通和流动过程的增长模型,将股票作为设计数据市场的一个重要因素。该模型强调了在概念设计中考虑如何设计学习效率的必要性。将该模型应用于一个商业案例,讨论了库存和学习效率的要素,目的是设计一个数据市场。
{"title":"Knowledge-Based Circulation Growth Model: Applying a Data Marketplace to Concept Design","authors":"Jun Nakamura, Masahiko Teramoto","doi":"10.1109/ICDMW.2015.81","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.81","url":null,"abstract":"The authors introduce a growth model of the circulation of a stock and flow process that makes use of stock as an important factor in designing a data marketplace. The model highlights the necessity of considering how learning efficiency must be designed for the purpose of concept design. The model is applied to a business case in commercial industry to discuss the essentials of stock and learning efficiency with the aim of designing a data marketplace.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121224155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sentiment Polarity Classification Using Structural Features 基于结构特征的情感极性分类
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.57
D. Ansari
This work investigates the role of contrasting discourse relations signaled by cue phrases, together with phrase positional information, in predicting sentiment at the phrase level. Two domains of online reviews were chosen. The first domain is of nutritional supplement reviews, which are often poorly structured yet also allow certain simplifying assumptions to be made. The second domain is of hotel reviews, which have somewhat different characteristics. A corpus is built from these reviews, and manually tagged for polarity. We propose and evaluate a few new features that are realized through a lightweight method of discourse analysis, and use these features in a hybrid lexicon and machine learning based classifier. Our results show that these features may be used to obtain an improvement in classification accuracy compared to other traditional machine learning approaches.
本研究探讨了由提示短语和短语位置信息组成的对比语篇关系在短语层面预测情感的作用。选择了两个在线评论领域。第一个领域是营养补充剂评论,通常结构不佳,但也允许做出某些简化的假设。第二个领域是酒店评论,它们有一些不同的特点。从这些评论中构建一个语料库,并手动标记极性。我们提出并评估了一些通过轻量级话语分析方法实现的新特征,并将这些特征用于基于词典和机器学习的混合分类器中。我们的研究结果表明,与其他传统的机器学习方法相比,这些特征可以用来提高分类精度。
{"title":"Sentiment Polarity Classification Using Structural Features","authors":"D. Ansari","doi":"10.1109/ICDMW.2015.57","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.57","url":null,"abstract":"This work investigates the role of contrasting discourse relations signaled by cue phrases, together with phrase positional information, in predicting sentiment at the phrase level. Two domains of online reviews were chosen. The first domain is of nutritional supplement reviews, which are often poorly structured yet also allow certain simplifying assumptions to be made. The second domain is of hotel reviews, which have somewhat different characteristics. A corpus is built from these reviews, and manually tagged for polarity. We propose and evaluate a few new features that are realized through a lightweight method of discourse analysis, and use these features in a hybrid lexicon and machine learning based classifier. Our results show that these features may be used to obtain an improvement in classification accuracy compared to other traditional machine learning approaches.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114301458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Identifying Behavioral Characteristics in EGM Gambling Data Using Session Clustering 使用会话聚类识别EGM赌博数据的行为特征
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.211
Maria Gabriella Mosquera, Vlado Keelj
The rising accessibility and popularity of gambling products has increased interest in the effects of gambling. Nonetheless, research of gambling measures is scarce. This paper presents the application of data mining techniques, on 46,514 gambling sessions, to distinguish types of gambling and identify potential instances of problem gambling in EGMs. Gambling sessions included measures of gambling involvement, out-of-pocket expense, winnings and cost of gambling. In this first exploratory study, sessions were clustered into four clusters, as a stability test determined four clusters to be the most high-quality yielding and stable solution within our clustering criteria. Based on the expressed gambling behavior within these sessions, our k-means cluster analysis results indicated sessions were classified as potential non-problem gambling sessions, potential low risk gambling sessions, potential moderate risk gambling sessions, and potential problem gambling sessions. While the complexity of EGM data prevents researchers from recognizing the incidence of problem gambling in a specific individual, our methods suggest that the lack of player identification does not prevent one from identifying the incidence of problem gambling behavior.
赌博产品的日益普及和普及增加了人们对赌博影响的兴趣。然而,对赌博措施的研究却很少。本文介绍了数据挖掘技术在46,514个赌博会话中的应用,以区分赌博类型并识别egm中潜在的问题赌博实例。赌博时段包括赌博参与程度、现金支出、奖金和赌博成本。在第一个探索性研究中,我们将会话聚为四个簇,因为稳定性测试确定了四个簇是我们聚类标准中最优质的产量和稳定的解决方案。基于这些会话中表达的赌博行为,我们的k均值聚类分析结果表明,会话被分类为潜在的无问题赌博会话,潜在的低风险赌博会话,潜在的中等风险赌博会话和潜在的问题赌博会话。虽然EGM数据的复杂性使研究人员无法识别特定个体的问题赌博发生率,但我们的方法表明,缺乏玩家身份并不妨碍人们识别问题赌博行为的发生率。
{"title":"Identifying Behavioral Characteristics in EGM Gambling Data Using Session Clustering","authors":"Maria Gabriella Mosquera, Vlado Keelj","doi":"10.1109/ICDMW.2015.211","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.211","url":null,"abstract":"The rising accessibility and popularity of gambling products has increased interest in the effects of gambling. Nonetheless, research of gambling measures is scarce. This paper presents the application of data mining techniques, on 46,514 gambling sessions, to distinguish types of gambling and identify potential instances of problem gambling in EGMs. Gambling sessions included measures of gambling involvement, out-of-pocket expense, winnings and cost of gambling. In this first exploratory study, sessions were clustered into four clusters, as a stability test determined four clusters to be the most high-quality yielding and stable solution within our clustering criteria. Based on the expressed gambling behavior within these sessions, our k-means cluster analysis results indicated sessions were classified as potential non-problem gambling sessions, potential low risk gambling sessions, potential moderate risk gambling sessions, and potential problem gambling sessions. While the complexity of EGM data prevents researchers from recognizing the incidence of problem gambling in a specific individual, our methods suggest that the lack of player identification does not prevent one from identifying the incidence of problem gambling behavior.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131406409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AFFM: Auto feature engineering in field-aware factorization machines for predictive analytics 面向预测分析的现场感知分解机器中的自动特征工程
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.245
Lars Ropeid Selsaas, B. Agrawal, Chunming Rong, T. Wiktorski
User identification and prediction is one typical problem with the cross-device connection. User identification is useful for the recommendation engine, online advertising, and user experiences. Extreme sparse and large-scale data make user identification a challenging problem. To achieve better performance and accuracy for identification a better model with short turnaround time, and able to handle extremely sparse and large-scale data is the key. In this paper, we proposed a novel efficient machine learning approach to deal with such problem. We have adapted Field-aware Factorization Machine's approach using auto feature engineering techniques. Our model has the capacity to handle multiple features within the same field. The model provides an efficient way to handle the fields in the matrix. It counts the unique fields in the matrix and divides both the matrix with that value, which provide an efficient and scalable technique in term of time complexity. The accuracy of the model is 0.864845, when tested with Drawbridge datasets released in the context of the ICDM 2015 Cross-Device Connections Challenge.
用户识别和预测是跨设备连接的一个典型问题。用户标识对于推荐引擎、在线广告和用户体验非常有用。极度稀疏和大规模的数据使得用户识别成为一个具有挑战性的问题。为了获得更好的识别性能和准确性,一个更好的、周转时间短的、能够处理极其稀疏和大规模数据的模型是关键。在本文中,我们提出了一种新的高效的机器学习方法来处理这类问题。我们利用自动特征工程技术改编了现场感知因子分解机的方法。我们的模型有能力处理同一字段内的多个特征。该模型为处理矩阵中的字段提供了一种有效的方法。它对矩阵中的唯一字段进行计数,并将矩阵与该值相除,这在时间复杂度方面提供了一种有效且可扩展的技术。当使用ICDM 2015跨设备连接挑战中发布的Drawbridge数据集进行测试时,该模型的准确性为0.864845。
{"title":"AFFM: Auto feature engineering in field-aware factorization machines for predictive analytics","authors":"Lars Ropeid Selsaas, B. Agrawal, Chunming Rong, T. Wiktorski","doi":"10.1109/ICDMW.2015.245","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.245","url":null,"abstract":"User identification and prediction is one typical problem with the cross-device connection. User identification is useful for the recommendation engine, online advertising, and user experiences. Extreme sparse and large-scale data make user identification a challenging problem. To achieve better performance and accuracy for identification a better model with short turnaround time, and able to handle extremely sparse and large-scale data is the key. In this paper, we proposed a novel efficient machine learning approach to deal with such problem. We have adapted Field-aware Factorization Machine's approach using auto feature engineering techniques. Our model has the capacity to handle multiple features within the same field. The model provides an efficient way to handle the fields in the matrix. It counts the unique fields in the matrix and divides both the matrix with that value, which provide an efficient and scalable technique in term of time complexity. The accuracy of the model is 0.864845, when tested with Drawbridge datasets released in the context of the ICDM 2015 Cross-Device Connections Challenge.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116532658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Extensible Query Framework for Unstructured Medical Data -- A Big Data Approach 非结构化医疗数据的可扩展查询框架——一种大数据方法
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.67
Sarmad Istephan, Mohammad-Reza Siadat
With the ever increasing amount of medical image scans, it is critical to have an extensible framework that allows for mining such unstructured data. Such a framework would provide a medical researcher the flexibility in validating and testing hypotheses. Important characteristics of this type of framework include accuracy, efficiency and extensibility. The objective of this work is to build an initial implementation of such a framework within a big data paradigm. To this end, a clinical data warehouse was built for the structured data and a set of modules were created to analyze the unstructured content. The framework contains built-in modules but is flexible in allowing the user to import their own, making it extensible. Furthermore, the framework runs the modules in a Hadoop cluster making it efficient by utilizing the distributed computing capability of big data approach. To test the framework, simulated data of 1,000 patients along with their hippocampi images were created. The results show that the framework accurately returned all 15 patients who had hippocampal resection with hippocampus ipsilateral to surgery being less than 20% the size of the hippocampus contralateral to surgery, using a built-in module. In addition, the framework allowed the user to run a different module using the previous output to further analyze the unstructured data. Finally, the framework also enabled the user to import a new module. This study paves the way towards showing the feasibility of such a framework to handle unstructured medical data in an accurate, efficient and extensible manner.
随着医学图像扫描量的不断增加,有一个可扩展的框架来挖掘这种非结构化数据是至关重要的。这样一个框架将为医学研究人员在验证和检验假设方面提供灵活性。这类框架的重要特点包括准确性、高效性和可扩展性。这项工作的目标是在大数据范例中构建这样一个框架的初步实现。为此,对结构化数据建立临床数据仓库,并创建一组模块对非结构化内容进行分析。该框架包含内置模块,但允许用户灵活地导入自己的模块,从而使其具有可扩展性。此外,该框架在Hadoop集群中运行模块,利用大数据方法的分布式计算能力,使其高效。为了测试这个框架,他们创建了1000名患者的模拟数据以及他们的海马体图像。结果表明,该框架使用内置模块,准确地将所有15例同侧海马切除术患者的海马大小小于对侧海马大小的20%的患者返回到手术。此外,该框架允许用户使用前面的输出运行不同的模块,以进一步分析非结构化数据。最后,该框架还允许用户导入新模块。本研究为展示该框架以准确、高效和可扩展的方式处理非结构化医疗数据的可行性铺平了道路。
{"title":"Extensible Query Framework for Unstructured Medical Data -- A Big Data Approach","authors":"Sarmad Istephan, Mohammad-Reza Siadat","doi":"10.1109/ICDMW.2015.67","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.67","url":null,"abstract":"With the ever increasing amount of medical image scans, it is critical to have an extensible framework that allows for mining such unstructured data. Such a framework would provide a medical researcher the flexibility in validating and testing hypotheses. Important characteristics of this type of framework include accuracy, efficiency and extensibility. The objective of this work is to build an initial implementation of such a framework within a big data paradigm. To this end, a clinical data warehouse was built for the structured data and a set of modules were created to analyze the unstructured content. The framework contains built-in modules but is flexible in allowing the user to import their own, making it extensible. Furthermore, the framework runs the modules in a Hadoop cluster making it efficient by utilizing the distributed computing capability of big data approach. To test the framework, simulated data of 1,000 patients along with their hippocampi images were created. The results show that the framework accurately returned all 15 patients who had hippocampal resection with hippocampus ipsilateral to surgery being less than 20% the size of the hippocampus contralateral to surgery, using a built-in module. In addition, the framework allowed the user to run a different module using the previous output to further analyze the unstructured data. Finally, the framework also enabled the user to import a new module. This study paves the way towards showing the feasibility of such a framework to handle unstructured medical data in an accurate, efficient and extensible manner.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115134410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Machine Learning Approach to Identify Users Across Their Digital Devices 跨数字设备识别用户的机器学习方法
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.243
Thakur Raj Anand, Oleksii Renov
This paper discusses methods to identify individual users across their digital devices as part of the ICDM 2015 competition hosted on Kaggle. The competition's data set and prize pool were provided by http://www.drawbrid.ge/ in sponsorship with the ICDM 2015 conference. The methods described in this paper focuses on feature engineering and generic machine learning algorithms like Extreme Gradient Boosting (xgboost), Follow the Reguralized Leader Proximal etc. Machine learning algorithms discussed in this paper can help improve the marketer's ability to identify individual users as they switch between devices and show relevant content/recommendation to users wherever they go.
本文讨论了在数字设备上识别个人用户的方法,作为Kaggle主办的ICDM 2015竞赛的一部分。比赛的数据集和奖金池由http://www.drawbrid.ge/与ICDM 2015会议赞助提供。本文描述的方法侧重于特征工程和通用机器学习算法,如极端梯度增强(xgboost),遵循正则化Leader Proximal等。本文中讨论的机器学习算法可以帮助营销人员提高识别个人用户的能力,因为他们在设备之间切换,并向用户显示相关的内容/推荐。
{"title":"Machine Learning Approach to Identify Users Across Their Digital Devices","authors":"Thakur Raj Anand, Oleksii Renov","doi":"10.1109/ICDMW.2015.243","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.243","url":null,"abstract":"This paper discusses methods to identify individual users across their digital devices as part of the ICDM 2015 competition hosted on Kaggle. The competition's data set and prize pool were provided by http://www.drawbrid.ge/ in sponsorship with the ICDM 2015 conference. The methods described in this paper focuses on feature engineering and generic machine learning algorithms like Extreme Gradient Boosting (xgboost), Follow the Reguralized Leader Proximal etc. Machine learning algorithms discussed in this paper can help improve the marketer's ability to identify individual users as they switch between devices and show relevant content/recommendation to users wherever they go.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127805711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
2015 IEEE International Conference on Data Mining Workshop (ICDMW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1