Pub Date : 2023-12-09DOI: 10.1016/j.is.2023.102334
Dongsoo Jang , Qinglong Li , Chaeyoung Lee , Jaekyeong Kim
In E-commerce platforms, auxiliary information containing several attributes (e.g., price, quality, and brand) can improve recommendation performance. However, previous studies used a simple combined embedding approach that did not consider the importance of each attribute embedded in the auxiliary information or only used some attributes of the auxiliary information. However, user purchasing behavior can vary significantly depending on the attributes. Thus, we propose multi attribute-based matrix factorization (MAMF), which considers the importance of each attribute embedded in various auxiliary information. MAMF obtains more representative and specific attention features of the user and item using a self-attention mechanism. By acquiring attentive representation, MAMF learns a high-level interaction precisely between users and items. To evaluate the performance of the proposed MAMF, we conducted extensive experiments using three real-world datasets from amazon.com. The experimental results show that MAMF exhibits excellent recommendation performance compared with various baseline models.
{"title":"Attention-based multi attribute matrix factorization for enhanced recommendation performance","authors":"Dongsoo Jang , Qinglong Li , Chaeyoung Lee , Jaekyeong Kim","doi":"10.1016/j.is.2023.102334","DOIUrl":"10.1016/j.is.2023.102334","url":null,"abstract":"<div><p><span>In E-commerce platforms, auxiliary information containing several attributes (e.g., price, quality, and brand) can improve recommendation performance. However, previous studies used a simple combined embedding approach that did not consider the importance of each attribute embedded in the auxiliary information or only used some attributes of the auxiliary information. However, user purchasing behavior can vary significantly depending on the attributes. Thus, we propose multi attribute-based matrix factorization (MAMF), which considers the importance of each attribute embedded in various auxiliary information. MAMF obtains more representative and specific attention features of the user and item using a self-attention mechanism. By acquiring attentive representation, MAMF learns a high-level interaction precisely between users and items. To evaluate the performance of the proposed MAMF, we conducted extensive experiments using three real-world datasets from amazon.com. The experimental results show that MAMF exhibits excellent recommendation performance compared with various </span>baseline models.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102334"},"PeriodicalIF":3.7,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138572711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-09DOI: 10.1016/j.is.2023.102333
Zebang Liu , Luo Chen , Mengyu Ma , Anran Yang , Zhinong Zhong , Ning Jing
The visual exploration of geospatial vector data has become an increasingly important part of the management and analysis of geospatial vector big data (GVBD). With the rapid growth of data scale, it is difficult to realize efficient visual exploration of GVBD by current visualization technologies even if parallel distributed computing technology is adopted. To fill the gap, this paper proposes a visual exploration approach of GVBD on the web map. In this approach, we propose the display-driven computing model and combine the traditional data-driven computing method to design an adaptive real-time visualization algorithm. At the same time, we design a pixel-quad-R tree spatial index structure. Finally, we realize the multilevel real-time interactive visual exploration of GVBD in a single machine by constructing the index offline to support the online computation for visualization, and all the visualization results can be calculated in real-time without the external cache occupation. The experimental results show that the approach outperforms current mainstream visualization methods and obtains the visualization results at any zoom level within 0.5 s, which can be well applied to multilevel real-time interactive visual exploration of the billion-scale GVBD.
{"title":"An efficient visual exploration approach of geospatial vector big data on the web map","authors":"Zebang Liu , Luo Chen , Mengyu Ma , Anran Yang , Zhinong Zhong , Ning Jing","doi":"10.1016/j.is.2023.102333","DOIUrl":"10.1016/j.is.2023.102333","url":null,"abstract":"<div><p><span>The visual exploration of geospatial vector data has become an increasingly important part of the management and analysis of geospatial vector big data (GVBD). With the rapid growth of data scale, it is difficult to realize efficient visual exploration of GVBD by current visualization technologies even if parallel distributed computing<span> technology is adopted. To fill the gap, this paper proposes a visual exploration approach of GVBD on the web map. In this approach, we propose the display-driven computing model and combine the traditional data-driven computing method to design an adaptive real-time visualization algorithm. At the same time, we design a pixel-quad-R tree spatial index structure. Finally, we realize the multilevel real-time interactive visual exploration of GVBD in a single machine by constructing the index offline to support the online computation for visualization, and all the visualization results can be calculated in real-time without the external cache occupation. The experimental results show that the approach outperforms current mainstream </span></span>visualization methods and obtains the visualization results at any zoom level within 0.5 s, which can be well applied to multilevel real-time interactive visual exploration of the billion-scale GVBD.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102333"},"PeriodicalIF":3.7,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138567682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-07DOI: 10.1016/j.is.2023.102330
Jari Peeperkorn , Seppe vanden Broucke , Jochen De Weerdt
Previous studies investigating the efficacy of long short-term memory (LSTM) recurrent neural networks in predictive process monitoring and their ability to capture the underlying process structure have raised concerns about their limited ability to generalize to unseen behavior. Event logs often fail to capture the full spectrum of behavior permitted by the underlying processes. To overcome these challenges, this study introduces innovative validation set sampling strategies based on control-flow variant-based resampling. These strategies have undergone extensive evaluation to assess their impact on hyperparameter selection and early stopping, resulting in notable enhancements to the generalization capabilities of trained LSTM models. In addition, this study expands the experimental framework to enable accurate interpretation of underlying process models and provide valuable insights. By conducting experiments with event logs representing process models of varying complexities, this research elucidates the effectiveness of the proposed validation strategies. Furthermore, the extended framework facilitates investigations into the influence of event log completeness on the learning quality of predictive process models. The novel validation set sampling strategies proposed in this study facilitate the development of more effective and reliable predictive process models, ultimately bolstering generalization capabilities and improving the understanding of underlying process dynamics.
{"title":"Validation set sampling strategies for predictive process monitoring","authors":"Jari Peeperkorn , Seppe vanden Broucke , Jochen De Weerdt","doi":"10.1016/j.is.2023.102330","DOIUrl":"10.1016/j.is.2023.102330","url":null,"abstract":"<div><p>Previous studies investigating the efficacy of long short-term memory (LSTM) recurrent neural networks in predictive process monitoring and their ability to capture the underlying process structure have raised concerns about their limited ability to generalize to unseen behavior. Event logs often fail to capture the full spectrum of behavior permitted by the underlying processes. To overcome these challenges, this study introduces innovative validation set sampling strategies based on control-flow variant-based resampling. These strategies have undergone extensive evaluation to assess their impact on hyperparameter selection and early stopping, resulting in notable enhancements to the generalization capabilities of trained LSTM models. In addition, this study expands the experimental framework to enable accurate interpretation of underlying process models and provide valuable insights. By conducting experiments with event logs representing process models of varying complexities, this research elucidates the effectiveness of the proposed validation strategies. Furthermore, the extended framework facilitates investigations into the influence of event log completeness on the learning quality of predictive process models. The novel validation set sampling strategies proposed in this study facilitate the development of more effective and reliable predictive process models, ultimately bolstering generalization capabilities and improving the understanding of underlying process dynamics.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102330"},"PeriodicalIF":3.7,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138566986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-04DOI: 10.1016/j.is.2023.102323
Witold Andrzejewski , Bartosz Bębel , Paweł Boiński , Robert Wrembel
Data stored in information systems are often erroneous. Duplicate data are one of the typical error type. To discover and handle duplicates, the so-called deduplication methods are applied. They are complex and time costly algorithms. In data deduplication, pairs of records are compared and their similarities are computed. For a given deduplication problem, challenging tasks are: (1) to decide which similarity measures are the most adequate to given attributes being compared and (2) defining the importance of attributes being compared, and (3) defining adequate similarity thresholds between similar and not similar pairs of records. In this paper, we summarize our experience gained from a real R&D project run for a large financial institution. In particular, we answer the following three research questions: (1) what are the adequate similarity measures for comparing attributes of text data types, (2) what are the adequate weights of attributes in the procedure of comparing pairs of records, and (3) what are the similarity thresholds between classes: duplicates, probably duplicates, non-duplicates? The answers to the questions are based on the experimental evaluation of 54 similarity measures for text values. The measures were compared on five different real data sets of different data characteristic. The similarity measures were assessed based on: (1) similarity values they produced for given values being compared and (2) their execution time. Furthermore, we present our method, based on mathematical programming, for computing weights of attributes and similarity thresholds for records being compared. The experimental evaluation of the method and its assessment by experts from the financial institution proved that it is adequate to the deduplication problem at hand. The whole data deduplication pipeline that we have developed has been deployed in the financial institution and is run in their production system, processing batches of over 20 million of customer records.
{"title":"On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records","authors":"Witold Andrzejewski , Bartosz Bębel , Paweł Boiński , Robert Wrembel","doi":"10.1016/j.is.2023.102323","DOIUrl":"10.1016/j.is.2023.102323","url":null,"abstract":"<div><p><span><span>Data stored in information systems are often erroneous. Duplicate data are one of the typical error type. To discover and handle duplicates, the so-called deduplication methods are applied. They are complex and time costly algorithms. In </span>data deduplication<span><span>, pairs of records are compared and their similarities are computed. For a given deduplication problem, challenging tasks are: (1) to decide which similarity measures are the most adequate to given attributes being compared and (2) defining the importance of attributes being compared, and (3) defining adequate similarity thresholds between similar and not similar pairs of records. In this paper, we summarize our experience gained from a real R&D project run for a large financial institution. In particular, we answer the following three research questions: (1) what are the adequate similarity measures for comparing attributes of text data types, (2) what are the adequate weights of attributes in the procedure of comparing pairs of records, and (3) what are the similarity thresholds between classes: duplicates, probably duplicates, non-duplicates? The answers to the questions are based on the experimental evaluation of 54 similarity measures for text values. The measures were compared on five different </span>real data sets of different data characteristic. The similarity measures were assessed based on: (1) similarity values they produced for given values being compared and (2) their execution time. Furthermore, we present our method, based on </span></span>mathematical programming, for computing weights of attributes and similarity thresholds for records being compared. The experimental evaluation of the method and its assessment by experts from the financial institution proved that it is adequate to the deduplication problem at hand. The whole data deduplication pipeline that we have developed has been deployed in the financial institution and is run in their production system, processing batches of over 20 million of customer records.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102323"},"PeriodicalIF":3.7,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01DOI: 10.1016/j.is.2023.102339
Tijs Slaats, S. Debois, Christoffer Olling Back, Axel Kjeld Fjelrad Christfort
{"title":"Foundations and practice of binary process discovery","authors":"Tijs Slaats, S. Debois, Christoffer Olling Back, Axel Kjeld Fjelrad Christfort","doi":"10.1016/j.is.2023.102339","DOIUrl":"https://doi.org/10.1016/j.is.2023.102339","url":null,"abstract":"","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"188 ","pages":""},"PeriodicalIF":3.7,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139026507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-30DOI: 10.1016/j.is.2023.102321
Yufei Hu , Liangxiao Jiang , Wenjun Zhang
Crowdsourcing offers a cost-effective way to obtain multiple noisy labels for each instance by employing multiple crowd workers. Then label integration is used to infer its integrated label. Despite the effectiveness of label integration algorithms, there always remains a certain degree of noise in the integrated labels. Thus noise correction algorithms have been proposed to reduce the impact of noise. However, almost all existing noise correction algorithms only focus on individual workers but ignore the correlations among workers. In this paper, we argue that similar workers have similar annotating skills and tend to be consistent in annotating same or similar instances. Based on this premise, we propose a novel noise correction algorithm called worker similarity-based noise correction (WSNC). At first, WSNC exploits the annotating information of similar workers on similar instances to estimate the quality of each label annotated by each worker on each instance. Then, WSNC re-infers the integrated label of each instance based on the qualities of its multiple noisy labels. Finally, WSNC considers the instance whose re-inferred integrated label differs from its original integrated label as a noise instance and further corrects it. The extensive experiments on a large number of simulated and three real-world crowdsourced datasets verify the effectiveness of WSNC.
{"title":"Worker similarity-based noise correction for crowdsourcing","authors":"Yufei Hu , Liangxiao Jiang , Wenjun Zhang","doi":"10.1016/j.is.2023.102321","DOIUrl":"https://doi.org/10.1016/j.is.2023.102321","url":null,"abstract":"<div><p>Crowdsourcing offers a cost-effective way to obtain multiple noisy labels for each instance by employing multiple crowd workers. Then label integration is used to infer its integrated label. Despite the effectiveness of label integration algorithms, there always remains a certain degree of noise in the integrated labels. Thus noise correction algorithms have been proposed to reduce the impact of noise. However, almost all existing noise correction algorithms only focus on individual workers but ignore the correlations among workers. In this paper, we argue that similar workers have similar annotating skills and tend to be consistent in annotating same or similar instances. Based on this premise, we propose a novel noise correction algorithm called worker similarity-based noise correction (WSNC). At first, WSNC exploits the annotating information of similar workers on similar instances to estimate the quality of each label annotated by each worker on each instance. Then, WSNC re-infers the integrated label of each instance based on the qualities of its multiple noisy labels. Finally, WSNC considers the instance whose re-inferred integrated label differs from its original integrated label as a noise instance and further corrects it. The extensive experiments on a large number of simulated and three real-world crowdsourced datasets verify the effectiveness of WSNC.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102321"},"PeriodicalIF":3.7,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138474894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-29DOI: 10.1016/j.is.2023.102322
Pu Ji, Minghui Yang, Rui Sun
Consumers’ needs present a trend of diversification, which causes the emergence of diversified recommendation systems. However, existing diversified recommendation research mostly focuses on objective function construction rather than on the root cause that limits diversity—namely, imbalanced data distribution. This study considers how to balance data distribution to improve recommendation diversity. We propose a novel self-supervised graph model based on counterfactual learning (SSG-CL) for diversified recommendation. SSG-CL first distinguishes the dominant and disadvantageous categories for each user based on long-tail theory. It then introduces counterfactual learning to construct an auxiliary view with relatively balanced distribution among the dominant and disadvantageous categories. Next, we conduct contrastive learning between the user–item interaction graph and the auxiliary view as the self-supervised auxiliary task that aims to improve recommendation diversity. Finally, SSG-CL leverages a multitask training strategy to jointly optimize the main accuracy-oriented recommendation task and the self-supervised auxiliary task. Finally, we conduct experimental studies on real-world datasets, and the results indicate good SSG-CL performance in terms of accuracy and diversity.
{"title":"A novel self-supervised graph model based on counterfactual learning for diversified recommendation","authors":"Pu Ji, Minghui Yang, Rui Sun","doi":"10.1016/j.is.2023.102322","DOIUrl":"10.1016/j.is.2023.102322","url":null,"abstract":"<div><p>Consumers’ needs present a trend of diversification, which causes the emergence of diversified recommendation systems. However, existing diversified recommendation research mostly focuses on objective function construction rather than on the root cause that limits diversity—namely, imbalanced data distribution. This study considers how to balance data distribution to improve recommendation diversity. We propose a novel self-supervised graph model based on counterfactual learning (SSG-CL) for diversified recommendation. SSG-CL first distinguishes the dominant and disadvantageous categories for each user based on long-tail theory. It then introduces counterfactual learning to construct an auxiliary view with relatively balanced distribution among the dominant and disadvantageous categories. Next, we conduct contrastive learning between the user–item interaction graph and the auxiliary view as the self-supervised auxiliary task that aims to improve recommendation diversity. Finally, SSG-CL leverages a multitask training strategy to jointly optimize the main accuracy-oriented recommendation task and the self-supervised auxiliary task. Finally, we conduct experimental studies on real-world datasets, and the results indicate good SSG-CL performance in terms of accuracy and diversity.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102322"},"PeriodicalIF":3.7,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-25DOI: 10.1016/j.is.2023.102320
Greg Van Houdt , Massimiliano de Leoni , Niels Martin , Benoît Depaire
These days, businesses keep track of more and more data in their information systems. Moreover, this data becomes more fine-grained than ever, tracking clicks and mutations in databases at the lowest level possible. Faced with such data, process discovery often struggles with producing comprehensible models, as they instead return spaghetti-like models. Such finely granulated models do not fit the business user’s mental model of the process under investigation. To tackle this, event log abstraction (ELA) techniques can transform the underlying event log to a higher granularity level. However, insights into the performance of these techniques are lacking in literature as results are only based on small-scale experiments and are often inconclusive. Against this background, this paper evaluates state-of-the-art abstraction techniques on 400 event logs. Results show that ELA sacrifices fitness for precision, but complexity reductions heavily depend on the ELA technique used. This study also illustrates the importance of a larger-scale experiment, as sub-sampling of results leads to contradictory conclusions.
{"title":"An empirical evaluation of unsupervised event log abstraction techniques in process mining","authors":"Greg Van Houdt , Massimiliano de Leoni , Niels Martin , Benoît Depaire","doi":"10.1016/j.is.2023.102320","DOIUrl":"https://doi.org/10.1016/j.is.2023.102320","url":null,"abstract":"<div><p>These days, businesses keep track of more and more data in their information systems. Moreover, this data becomes more fine-grained than ever, tracking clicks and mutations in databases at the lowest level possible. Faced with such data, process discovery often struggles with producing comprehensible models, as they instead return spaghetti-like models. Such finely granulated models do not fit the business user’s mental model of the process under investigation. To tackle this, event log abstraction (ELA) techniques can transform the underlying event log to a higher granularity level. However, insights into the performance of these techniques are lacking in literature as results are only based on small-scale experiments and are often inconclusive. Against this background, this paper evaluates state-of-the-art abstraction techniques on 400 event logs. Results show that ELA sacrifices fitness for precision, but complexity reductions heavily depend on the ELA technique used. This study also illustrates the importance of a larger-scale experiment, as sub-sampling of results leads to contradictory conclusions.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102320"},"PeriodicalIF":3.7,"publicationDate":"2023-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138454299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-21DOI: 10.1016/j.is.2023.102318
Hend A. Selmy , Hoda K. Mohamed , Walaa Medhat
Deep learning (DL), as one of the most active machine learning research fields, has achieved great success in numerous scientific and technological disciplines, including speech recognition, image classification, language processing, big data analytics, and many more. Big data analytics (BDA), where raw data is often unlabeled or uncategorized, can greatly benefit from DL because of its ability to analyze and learn from enormous amounts of unstructured data. This survey paper tackles a comprehensive overview of state-of-the-art DL techniques applied in BDA. The main target of this survey is intended to illustrate the significance of DL and its taxonomy and detail the basic techniques used in BDA. It also explains the DL techniques used in big IoT data applications as well as their various complexities and challenges. The survey presents various real-world data-intensive applications where DL techniques can be applied. In particular, it concentrates on the DL techniques in accordance with the BDA type for each application domain. Additionally, the survey examines DL benchmarked frameworks used in BDA and reviews the available benchmarked datasets, besides analyzing the strengths and limitations of each DL technique and their suitable applications. Further, a comparative analysis is also presented by comparing existing approaches to the DL methods used in BDA. Finally, the challenges of DL modeling and future directions are discussed.
{"title":"Big data analytics deep learning techniques and applications: A survey","authors":"Hend A. Selmy , Hoda K. Mohamed , Walaa Medhat","doi":"10.1016/j.is.2023.102318","DOIUrl":"https://doi.org/10.1016/j.is.2023.102318","url":null,"abstract":"<div><p>Deep learning (DL), as one of the most active machine learning research fields, has achieved great success in numerous scientific and technological disciplines, including speech recognition, image classification, language processing, big data analytics, and many more. Big data analytics (BDA), where raw data is often unlabeled or uncategorized, can greatly benefit from DL because of its ability to analyze and learn from enormous amounts of unstructured data. This survey paper tackles a comprehensive overview of state-of-the-art DL techniques applied in BDA. The main target of this survey is intended to illustrate the significance of DL and its taxonomy and detail the basic techniques used in BDA. It also explains the DL techniques used in big IoT data applications as well as their various complexities and challenges. The survey presents various real-world data-intensive applications where DL techniques can be applied. In particular, it concentrates on the DL techniques in accordance with the BDA type for each application domain. Additionally, the survey examines DL benchmarked frameworks used in BDA and reviews the available benchmarked datasets, besides analyzing the strengths and limitations of each DL technique and their suitable applications. Further, a comparative analysis is also presented by comparing existing approaches to the DL methods used in BDA. Finally, the challenges of DL modeling and future directions are discussed.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"120 ","pages":"Article 102318"},"PeriodicalIF":3.7,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138436402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-19DOI: 10.1016/j.is.2023.102315
Tuomas Ketola, Thomas Roelleke
Data-driven investigations are increasingly dealing with non-moderated, non-standard and even manipulated information Whether the field in question is journalism, law enforcement, or insurance fraud it is becoming more and more difficult for investigators to verify the outcomes of various black-box systems To contribute to this need of discovery methods that can be used for verification, we introduce a methodology for document structure-driven investigative information retrieval (InvIR) InvIR is defined as a subtask of exploratory IR, where transparency and reasoning take centre stage The aim of InvIR is to facilitate the verification and discovery of facts from data and the communication of those facts to others From a technical perspective, the methodology applies recent work from structured document retrieval (SDR) concerned with formal retrieval constraints and information content-based field weighting (ICFW) Using ICFW, the paper establishes the concept of relevance structures to describe the document structure-based relevance of documents These contexts are then used to help the user navigate during their discovery process and to rank entities of interest The proposed methodology is evaluated using a prototype search system called Relevance Structure-based Entity Ranker (RSER) in order to demonstrate its the feasibility This methodology represents an interesting and important research direction in a world where transparency is becoming more vital than ever.
{"title":"Document structure-driven investigative information retrieval","authors":"Tuomas Ketola, Thomas Roelleke","doi":"10.1016/j.is.2023.102315","DOIUrl":"https://doi.org/10.1016/j.is.2023.102315","url":null,"abstract":"<div><p>Data-driven investigations are increasingly dealing with non-moderated, non-standard and even manipulated information Whether the field in question is journalism, law enforcement, or insurance fraud it is becoming more and more difficult for investigators to verify the outcomes of various black-box systems To contribute to this need of discovery methods that can be used for verification, we introduce a methodology for document structure-driven investigative information retrieval (InvIR) InvIR is defined as a subtask of exploratory IR, where transparency and reasoning take centre stage The aim of InvIR is to facilitate the verification and discovery of facts from data and the communication of those facts to others From a technical perspective, the methodology applies recent work from structured document retrieval (SDR) concerned with formal retrieval constraints and information content-based field weighting (ICFW) Using ICFW, the paper establishes the concept of relevance structures to describe the document structure-based relevance of documents These contexts are then used to help the user navigate during their discovery process and to rank entities of interest The proposed methodology is evaluated using a prototype search system called Relevance Structure-based Entity Ranker (RSER) in order to demonstrate its the feasibility This methodology represents an interesting and important research direction in a world where transparency is becoming more vital than ever.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102315"},"PeriodicalIF":3.7,"publicationDate":"2023-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001515/pdfft?md5=934dc470062407433a9cf64fc9053b41&pid=1-s2.0-S0306437923001515-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138454298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}