Identification of social scientifically relevant topics in an interview repository: a natural language processing experiment

IF 2 3区管理学 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE Journal of Documentation Pub Date : 2023-10-13 DOI:10.1108/jd-12-2022-0269

Judit Gárdos, Julia Egyed-Gergely, Anna Horváth, Balázs Pataki, Roza Vajda, András Micsik

{"title":"Identification of social scientifically relevant topics in an interview repository: a natural language processing experiment","authors":"Judit Gárdos, Julia Egyed-Gergely, Anna Horváth, Balázs Pataki, Roza Vajda, András Micsik","doi":"10.1108/jd-12-2022-0269","DOIUrl":null,"url":null,"abstract":"Purpose The present study is about generating metadata to enhance thematic transparency and facilitate research on interview collections at the Research Documentation Centre, Centre for Social Sciences (TK KDK) in Budapest. It explores the use of artificial intelligence (AI) in producing, managing and processing social science data and its potential to generate useful metadata to describe the contents of such archives on a large scale. Design/methodology/approach The authors combined manual and automated/semi-automated methods of metadata development and curation. The authors developed a suitable domain-oriented taxonomy to classify a large text corpus of semi-structured interviews. To this end, the authors adapted the European Language Social Science Thesaurus (ELSST) to produce a concise, hierarchical structure of topics relevant in social sciences. The authors identified and tested the most promising natural language processing (NLP) tools supporting the Hungarian language. The results of manual and machine coding will be presented in a user interface. Findings The study describes how an international social scientific taxonomy can be adapted to a specific local setting and tailored to be used by automated NLP tools. The authors show the potential and limitations of existing and new NLP methods for thematic assignment. The current possibilities of multi-label classification in social scientific metadata assignment are discussed, i.e. the problem of automated selection of relevant labels from a large pool. Originality/value Interview materials have not yet been used for building manually annotated training datasets for automated indexing of scientifically relevant topics in a data repository. Comparing various automated-indexing methods, this study shows a possible implementation of a researcher tool supporting custom visualizations and the faceted search of interview collections.","PeriodicalId":47969,"journal":{"name":"Journal of Documentation","volume":"17 1","pages":"0"},"PeriodicalIF":2.0000,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Documentation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/jd-12-2022-0269","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose The present study is about generating metadata to enhance thematic transparency and facilitate research on interview collections at the Research Documentation Centre, Centre for Social Sciences (TK KDK) in Budapest. It explores the use of artificial intelligence (AI) in producing, managing and processing social science data and its potential to generate useful metadata to describe the contents of such archives on a large scale. Design/methodology/approach The authors combined manual and automated/semi-automated methods of metadata development and curation. The authors developed a suitable domain-oriented taxonomy to classify a large text corpus of semi-structured interviews. To this end, the authors adapted the European Language Social Science Thesaurus (ELSST) to produce a concise, hierarchical structure of topics relevant in social sciences. The authors identified and tested the most promising natural language processing (NLP) tools supporting the Hungarian language. The results of manual and machine coding will be presented in a user interface. Findings The study describes how an international social scientific taxonomy can be adapted to a specific local setting and tailored to be used by automated NLP tools. The authors show the potential and limitations of existing and new NLP methods for thematic assignment. The current possibilities of multi-label classification in social scientific metadata assignment are discussed, i.e. the problem of automated selection of relevant labels from a large pool. Originality/value Interview materials have not yet been used for building manually annotated training datasets for automated indexing of scientifically relevant topics in a data repository. Comparing various automated-indexing methods, this study shows a possible implementation of a researcher tool supporting custom visualizations and the faceted search of interview collections.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

访谈资料库中社会科学相关主题的识别:自然语言处理实验

本研究旨在生成元数据，以提高专题透明度，并促进布达佩斯社会科学中心(TK KDK)研究文献中心访谈集的研究。它探讨了人工智能(AI)在生产、管理和处理社会科学数据方面的应用，以及它产生有用的元数据以大规模描述此类档案内容的潜力。设计/方法论/方法作者结合了手工和自动化/半自动化的元数据开发和管理方法。作者开发了一个合适的面向领域的分类法来对半结构化访谈的大型文本语料库进行分类。为此，作者改编了欧洲语言社会科学同义词库(ELSST)，以产生一个简洁的，层次结构的主题相关的社会科学。作者确定并测试了支持匈牙利语的最有前途的自然语言处理(NLP)工具。手工和机器编码的结果将在用户界面中呈现。该研究描述了国际社会科学分类法如何适应特定的当地环境，并为自动化NLP工具量身定制。作者展示了用于主题分配的现有和新的NLP方法的潜力和局限性。讨论了当前社会科学元数据分配中多标签分类的可能性，即从大池中自动选择相关标签的问题。独创性/价值访谈材料尚未用于构建人工注释的训练数据集，以便在数据存储库中自动索引科学相关主题。比较各种自动索引方法，本研究展示了一种可能实现的研究人员工具，支持自定义可视化和采访集合的分面搜索。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Documentation INFORMATION SCIENCE & LIBRARY SCIENCE-

CiteScore

4.20

自引率

14.30%

发文量

期刊介绍： The scope of the Journal of Documentation is broadly information sciences, encompassing all of the academic and professional disciplines which deal with recorded information. These include, but are certainly not limited to: ■Information science, librarianship and related disciplines ■Information and knowledge management ■Information and knowledge organisation ■Information seeking and retrieval, and human information behaviour ■Information and digital literacies