Carolina Pradier, Diego Kozlowski, Natsumi S. Shokida, Vincent Larivière
The Latin-American scientific community has achieved significant progress towards gender parity, with nearly equal representation of women and men scientists. Nevertheless, women continue to be underrepresented in scholarly communication. Throughout the 20th century, Latin America established its academic circuit, focusing on research topics of regional significance. However, the community has since reoriented its research towards the global academic circuit. Through an analysis of scientific publications, this article explores the relationship between gender inequalities in science and the integration of Latin-American researchers into the regional and global academic circuits between 1993 and 2022. We find that women are more likely to engage in the regional circuit, while men are more active within the global circuit. This trend is attributed to a thematic alignment between women's research interests and issues specific to Latin America. Furthermore, our results reveal that the mechanisms contributing to gender differences in symbolic capital accumulation vary between circuits. Women's work achieves equal or greater recognition compared to men's within the regional circuit, but generally garners less attention in the global circuit. Our findings suggest that policies aimed at strengthening the regional academic circuit would encourage scientists to address locally relevant topics while simultaneously fostering gender equality in science.
{"title":"Science for whom? The influence of the regional academic circuit on gender inequalities in Latin America","authors":"Carolina Pradier, Diego Kozlowski, Natsumi S. Shokida, Vincent Larivière","doi":"arxiv-2407.18783","DOIUrl":"https://doi.org/arxiv-2407.18783","url":null,"abstract":"The Latin-American scientific community has achieved significant progress\u0000towards gender parity, with nearly equal representation of women and men\u0000scientists. Nevertheless, women continue to be underrepresented in scholarly\u0000communication. Throughout the 20th century, Latin America established its\u0000academic circuit, focusing on research topics of regional significance.\u0000However, the community has since reoriented its research towards the global\u0000academic circuit. Through an analysis of scientific publications, this article\u0000explores the relationship between gender inequalities in science and the\u0000integration of Latin-American researchers into the regional and global academic\u0000circuits between 1993 and 2022. We find that women are more likely to engage in\u0000the regional circuit, while men are more active within the global circuit. This\u0000trend is attributed to a thematic alignment between women's research interests\u0000and issues specific to Latin America. Furthermore, our results reveal that the\u0000mechanisms contributing to gender differences in symbolic capital accumulation\u0000vary between circuits. Women's work achieves equal or greater recognition\u0000compared to men's within the regional circuit, but generally garners less\u0000attention in the global circuit. Our findings suggest that policies aimed at\u0000strengthening the regional academic circuit would encourage scientists to\u0000address locally relevant topics while simultaneously fostering gender equality\u0000in science.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"212 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most published Information Systems research are of the behavioral science research (BSR) category rather than the design science research (DSR) category. This is due in part to the BSR orientation of many IS doctoral programs, which often do not involve much technical courses. This includes IS doctoral programs that train Information and Communication Technologies for Development (ICT4D) researchers. Without such technical knowledge many doctoral and postdoctoral researchers will not feel confident in engaging in DSR research. Given the importance of designing artifacts that are appropriate for a given context, an important question is how can ICT4D and other IS researchers increase their IS technical content knowledge and intimacy with the DSR process. In this paper we present, a process for reviewing DSR papers that has as its objectives: enhancing technical content knowledge, increasing knowledge and understanding of approaches to designing and evaluating IS/IT artifacts, and facilitating the identification of new DSR opportunities. This process has been applied for more than a decade at a USA research university.
大多数已发表的信息系统研究都属于行为科学研究(BSR)范畴,而不是设计科学研究(DSR)范畴。这部分是由于许多信息系统博士课程都以 BSR 为导向,通常不涉及太多技术课程。这包括培养信息与传播技术促进发展(ICT4D)研究人员的 IS 博士课程。如果没有这些技术知识,许多博士和博士后研究人员将没有信心从事 DSR 研究。鉴于设计适合特定环境的人工制品的重要性,一个重要的问题是,ICT4D 和其他 IS 研究人员如何才能增加他们的 IS 技术内容知识,并提高他们对 DSR 过程的亲近感。在本文中,我们介绍了一种审查 DSR 论文的程序,其目标是:提高技术内容知识,增加对设计和评估 IS/IT 人工制品方法的了解和理解,以及促进发现新的 DSR 机会。美国一所研究型大学采用这一方法已有十多年的历史。
{"title":"A Process for Reviewing Design Science Research Papers to Enhance Content Knowledge & Research Opportunities","authors":"Kweku-Muata Osei-Bryson","doi":"arxiv-2408.07230","DOIUrl":"https://doi.org/arxiv-2408.07230","url":null,"abstract":"Most published Information Systems research are of the behavioral science\u0000research (BSR) category rather than the design science research (DSR) category.\u0000This is due in part to the BSR orientation of many IS doctoral programs, which\u0000often do not involve much technical courses. This includes IS doctoral programs\u0000that train Information and Communication Technologies for Development (ICT4D)\u0000researchers. Without such technical knowledge many doctoral and postdoctoral\u0000researchers will not feel confident in engaging in DSR research. Given the\u0000importance of designing artifacts that are appropriate for a given context, an\u0000important question is how can ICT4D and other IS researchers increase their IS\u0000technical content knowledge and intimacy with the DSR process. In this paper we\u0000present, a process for reviewing DSR papers that has as its objectives:\u0000enhancing technical content knowledge, increasing knowledge and understanding\u0000of approaches to designing and evaluating IS/IT artifacts, and facilitating the\u0000identification of new DSR opportunities. This process has been applied for more\u0000than a decade at a USA research university.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"425 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the exponential growth in the number of papers and the trend of AI research, the use of Generative AI for information retrieval and question-answering has become popular for conducting research surveys. However, novice researchers unfamiliar with a particular field may not significantly improve their efficiency in interacting with Generative AI because they have not developed divergent thinking in that field. This study aims to develop an in-depth Survey Forest Diagram that guides novice researchers in divergent thinking about the research topic by indicating the citation clues among multiple papers, to help expand the survey perspective for novice researchers.
{"title":"A Survey Forest Diagram : Gain a Divergent Insight View on a Specific Research Topic","authors":"Jinghong Li, Wen Gu, Koichi Ota, Shinobu Hasegawa","doi":"arxiv-2407.17081","DOIUrl":"https://doi.org/arxiv-2407.17081","url":null,"abstract":"With the exponential growth in the number of papers and the trend of AI\u0000research, the use of Generative AI for information retrieval and\u0000question-answering has become popular for conducting research surveys. However,\u0000novice researchers unfamiliar with a particular field may not significantly\u0000improve their efficiency in interacting with Generative AI because they have\u0000not developed divergent thinking in that field. This study aims to develop an\u0000in-depth Survey Forest Diagram that guides novice researchers in divergent\u0000thinking about the research topic by indicating the citation clues among\u0000multiple papers, to help expand the survey perspective for novice researchers.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141781333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefanie Haustein, Eric Schares, Juan Pablo Alperin, Madelaine Hare, Leigh-Ann Butler, Nina Schönfelder
This study presents estimates of the global expenditure on article processing charges (APCs) paid to six publishers for open access between 2019 and 2023. APCs are fees charged for publishing in some fully open access journals (gold) and in subscription journals to make individual articles open access (hybrid). There is currently no way to systematically track institutional, national or global expenses for open access publishing due to a lack of transparency in APC prices, what articles they are paid for, or who pays them. We therefore curated and used an open dataset of annual APC list prices from Elsevier, Frontiers, MDPI, PLOS, Springer Nature, and Wiley in combination with the number of open access articles from these publishers indexed by OpenAlex to estimate that, globally, a total of $8.349 billion ($8.968 billion in 2023 US dollars) were spent on APCs between 2019 and 2023. We estimate that in 2023 MDPI ($681.6 million), Elsevier ($582.8 million) and Springer Nature ($546.6) generated the most revenue with APCs. After adjusting for inflation, we also show that annual spending almost tripled from $910.3 million in 2019 to $2.538 billion in 2023, that hybrid exceed gold fees, and that the median APCs paid are higher than the median listed fees for both gold and hybrid. Our approach addresses major limitations in previous efforts to estimate APCs paid and offers much needed insight into an otherwise opaque aspect of the business of scholarly publishing. We call upon publishers to be more transparent about OA fees.
{"title":"Estimating global article processing charges paid to six publishers for open access between 2019 and 2023","authors":"Stefanie Haustein, Eric Schares, Juan Pablo Alperin, Madelaine Hare, Leigh-Ann Butler, Nina Schönfelder","doi":"arxiv-2407.16551","DOIUrl":"https://doi.org/arxiv-2407.16551","url":null,"abstract":"This study presents estimates of the global expenditure on article processing\u0000charges (APCs) paid to six publishers for open access between 2019 and 2023.\u0000APCs are fees charged for publishing in some fully open access journals (gold)\u0000and in subscription journals to make individual articles open access (hybrid).\u0000There is currently no way to systematically track institutional, national or\u0000global expenses for open access publishing due to a lack of transparency in APC\u0000prices, what articles they are paid for, or who pays them. We therefore curated\u0000and used an open dataset of annual APC list prices from Elsevier, Frontiers,\u0000MDPI, PLOS, Springer Nature, and Wiley in combination with the number of open\u0000access articles from these publishers indexed by OpenAlex to estimate that,\u0000globally, a total of $8.349 billion ($8.968 billion in 2023 US dollars) were\u0000spent on APCs between 2019 and 2023. We estimate that in 2023 MDPI ($681.6\u0000million), Elsevier ($582.8 million) and Springer Nature ($546.6) generated\u0000the most revenue with APCs. After adjusting for inflation, we also show that\u0000annual spending almost tripled from $910.3 million in 2019 to $2.538 billion\u0000in 2023, that hybrid exceed gold fees, and that the median APCs paid are higher\u0000than the median listed fees for both gold and hybrid. Our approach addresses\u0000major limitations in previous efforts to estimate APCs paid and offers much\u0000needed insight into an otherwise opaque aspect of the business of scholarly\u0000publishing. We call upon publishers to be more transparent about OA fees.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141781338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social Media platforms have offered invaluable opportunities for linguistic research. The availability of up-to-date data, coming from any part in the world, and coming from natural contexts, has allowed researchers to study language in real time. One of the fields that has made great use of social media platforms is Corpus Linguistics. There is currently a wide range of projects which have been able to successfully create corpora from social media. In this paper, we present the development and deployment of a linguistic corpus from Twitter posts in English, coming from 26 news agencies and 27 individuals. The main goal was to create a fully annotated English corpus for linguistic analysis. We include information on morphology and syntax, as well as NLP features such as tokenization, lemmas, and n- grams. The information is presented through a range of powerful visualisations for users to explore linguistic patterns in the corpus. With this tool, we aim to contribute to the area of language technologies applied to linguistic research.
{"title":"ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts","authors":"Simon Gonzalez","doi":"arxiv-2407.15374","DOIUrl":"https://doi.org/arxiv-2407.15374","url":null,"abstract":"Social Media platforms have offered invaluable opportunities for linguistic\u0000research. The availability of up-to-date data, coming from any part in the\u0000world, and coming from natural contexts, has allowed researchers to study\u0000language in real time. One of the fields that has made great use of social\u0000media platforms is Corpus Linguistics. There is currently a wide range of\u0000projects which have been able to successfully create corpora from social media.\u0000In this paper, we present the development and deployment of a linguistic corpus\u0000from Twitter posts in English, coming from 26 news agencies and 27 individuals.\u0000The main goal was to create a fully annotated English corpus for linguistic\u0000analysis. We include information on morphology and syntax, as well as NLP\u0000features such as tokenization, lemmas, and n- grams. The information is\u0000presented through a range of powerful visualisations for users to explore\u0000linguistic patterns in the corpus. With this tool, we aim to contribute to the\u0000area of language technologies applied to linguistic research.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"430 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141781249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The field of conlang has evidenced an important growth in the last decades. This has been the product of a wide interest in the use and study of conlangs for artistic purposes. However, one important question is what it is happening with conlang in the academic world. This paper aims to have an overall understanding of the literature on conlang research. With this we aim to give a realistic picture of the field in present days. We have implemented a computational linguistic approach, combining bibliometrics and network analysis to examine all publications available in the Scopus database. Analysing over 2300 academic publications since 1927 until 2022, we have found that Esperanto is by far the most documented conlang. Three main authors have contributed to this: Garv'ia R., Fiedler S., and Blanke D. The 1970s and 1980s have been the decades where the foundations of current research have been built. In terms of methodologies, language learning and experimental linguistics are the ones contributing to most to the preferred approaches of study in the field. We present the results and discuss our limitations and future work.
{"title":"A Network Analysis Approach to Conlang Research Literature","authors":"Simon Gonzalez","doi":"arxiv-2407.15370","DOIUrl":"https://doi.org/arxiv-2407.15370","url":null,"abstract":"The field of conlang has evidenced an important growth in the last decades.\u0000This has been the product of a wide interest in the use and study of conlangs\u0000for artistic purposes. However, one important question is what it is happening\u0000with conlang in the academic world. This paper aims to have an overall\u0000understanding of the literature on conlang research. With this we aim to give a\u0000realistic picture of the field in present days. We have implemented a\u0000computational linguistic approach, combining bibliometrics and network analysis\u0000to examine all publications available in the Scopus database. Analysing over\u00002300 academic publications since 1927 until 2022, we have found that Esperanto\u0000is by far the most documented conlang. Three main authors have contributed to\u0000this: Garv'ia R., Fiedler S., and Blanke D. The 1970s and 1980s have been the\u0000decades where the foundations of current research have been built. In terms of\u0000methodologies, language learning and experimental linguistics are the ones\u0000contributing to most to the preferred approaches of study in the field. We\u0000present the results and discuss our limitations and future work.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"429 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141781335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A key challenge in citation text generation is that the length of generated text often differs from the length of the target, lowering the quality of the generation. While prior works have investigated length-controlled generation, their effectiveness depends on knowing the appropriate generation length. In this work, we present an in-depth study of the limitations of predicting scientific citation text length and explore the use of heuristic estimates of desired length.
{"title":"Improving Citation Text Generation: Overcoming Limitations in Length Control","authors":"Biswadip Mandal, Xiangci Li, Jessica Ouyang","doi":"arxiv-2407.14997","DOIUrl":"https://doi.org/arxiv-2407.14997","url":null,"abstract":"A key challenge in citation text generation is that the length of generated\u0000text often differs from the length of the target, lowering the quality of the\u0000generation. While prior works have investigated length-controlled generation,\u0000their effectiveness depends on knowing the appropriate generation length. In\u0000this work, we present an in-depth study of the limitations of predicting\u0000scientific citation text length and explore the use of heuristic estimates of\u0000desired length.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141781337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces LLAssist, an open-source tool designed to streamline literature reviews in academic research. In an era of exponential growth in scientific publications, researchers face mounting challenges in efficiently processing vast volumes of literature. LLAssist addresses this issue by leveraging Large Language Models (LLMs) and Natural Language Processing (NLP) techniques to automate key aspects of the review process. Specifically, it extracts important information from research articles and evaluates their relevance to user-defined research questions. The goal of LLAssist is to significantly reduce the time and effort required for comprehensive literature reviews, allowing researchers to focus more on analyzing and synthesizing information rather than on initial screening tasks. By automating parts of the literature review workflow, LLAssist aims to help researchers manage the growing volume of academic publications more efficiently.
{"title":"LLAssist: Simple Tools for Automating Literature Review Using Large Language Models","authors":"Christoforus Yoga Haryanto","doi":"arxiv-2407.13993","DOIUrl":"https://doi.org/arxiv-2407.13993","url":null,"abstract":"This paper introduces LLAssist, an open-source tool designed to streamline\u0000literature reviews in academic research. In an era of exponential growth in\u0000scientific publications, researchers face mounting challenges in efficiently\u0000processing vast volumes of literature. LLAssist addresses this issue by\u0000leveraging Large Language Models (LLMs) and Natural Language Processing (NLP)\u0000techniques to automate key aspects of the review process. Specifically, it\u0000extracts important information from research articles and evaluates their\u0000relevance to user-defined research questions. The goal of LLAssist is to\u0000significantly reduce the time and effort required for comprehensive literature\u0000reviews, allowing researchers to focus more on analyzing and synthesizing\u0000information rather than on initial screening tasks. By automating parts of the\u0000literature review workflow, LLAssist aims to help researchers manage the\u0000growing volume of academic publications more efficiently.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Productivity in Research (PQ) is a scholarship granted by CNPq (Brazilian National Council for Scientific and Technological Development). This scholarship aims to recognize a few selected faculty researchers for their scientific production, outstanding technology and innovation in their respective areas of knowledge. In the present study, we evaluated the scientific production of the 185 researchers in the Computer Science area granted with PQ scholarship in the last PQ selection notice. To evaluate the productivity of each professor, we considered papers published in scientific journals and conferences (complete works) in a five years period (from 2017 to 2021). We analyzed the productivity in terms of both quantity and quality. We also evaluated its distribution over the country, universities and research facilities, as well as, the co-authorship network produced.
{"title":"Productivity profile of CNPq scholarship researchers in computer science from 2017 to 2021","authors":"Marcelo Keese Albertini, André Ricardo Backes","doi":"arxiv-2407.14690","DOIUrl":"https://doi.org/arxiv-2407.14690","url":null,"abstract":"Productivity in Research (PQ) is a scholarship granted by CNPq (Brazilian\u0000National Council for Scientific and Technological Development). This\u0000scholarship aims to recognize a few selected faculty researchers for their\u0000scientific production, outstanding technology and innovation in their\u0000respective areas of knowledge. In the present study, we evaluated the\u0000scientific production of the 185 researchers in the Computer Science area\u0000granted with PQ scholarship in the last PQ selection notice. To evaluate the\u0000productivity of each professor, we considered papers published in scientific\u0000journals and conferences (complete works) in a five years period (from 2017 to\u00002021). We analyzed the productivity in terms of both quantity and quality. We\u0000also evaluated its distribution over the country, universities and research\u0000facilities, as well as, the co-authorship network produced.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"66 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141781339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This research explores the nuanced differences in texts produced by AI and those written by humans, aiming to elucidate how language is expressed differently by AI and humans. Through comprehensive statistical data analysis, the study investigates various linguistic traits, patterns of creativity, and potential biases inherent in human-written and AI- generated texts. The significance of this research lies in its contribution to understanding AI's creative capabilities and its impact on literature, communication, and societal frameworks. By examining a meticulously curated dataset comprising 500K essays spanning diverse topics and genres, generated by LLMs, or written by humans, the study uncovers the deeper layers of linguistic expression and provides insights into the cognitive processes underlying both AI and human-driven textual compositions. The analysis revealed that human-authored essays tend to have a higher total word count on average than AI-generated essays but have a shorter average word length compared to AI- generated essays, and while both groups exhibit high levels of fluency, the vocabulary diversity of Human authored content is higher than AI generated content. However, AI- generated essays show a slightly higher level of novelty, suggesting the potential for generating more original content through AI systems. The paper addresses challenges in assessing the language generation capabilities of AI models and emphasizes the importance of datasets that reflect the complexities of human-AI collaborative writing. Through systematic preprocessing and rigorous statistical analysis, this study offers valuable insights into the evolving landscape of AI-generated content and informs future developments in natural language processing (NLP).
{"title":"Decoding AI and Human Authorship: Nuances Revealed Through NLP and Statistical Analysis","authors":"Mayowa Akinwande, Oluwaseyi Adeliyi, Toyyibat Yussuph","doi":"arxiv-2408.00769","DOIUrl":"https://doi.org/arxiv-2408.00769","url":null,"abstract":"This research explores the nuanced differences in texts produced by AI and\u0000those written by humans, aiming to elucidate how language is expressed\u0000differently by AI and humans. Through comprehensive statistical data analysis,\u0000the study investigates various linguistic traits, patterns of creativity, and\u0000potential biases inherent in human-written and AI- generated texts. The\u0000significance of this research lies in its contribution to understanding AI's\u0000creative capabilities and its impact on literature, communication, and societal\u0000frameworks. By examining a meticulously curated dataset comprising 500K essays\u0000spanning diverse topics and genres, generated by LLMs, or written by humans,\u0000the study uncovers the deeper layers of linguistic expression and provides\u0000insights into the cognitive processes underlying both AI and human-driven\u0000textual compositions. The analysis revealed that human-authored essays tend to\u0000have a higher total word count on average than AI-generated essays but have a\u0000shorter average word length compared to AI- generated essays, and while both\u0000groups exhibit high levels of fluency, the vocabulary diversity of Human\u0000authored content is higher than AI generated content. However, AI- generated\u0000essays show a slightly higher level of novelty, suggesting the potential for\u0000generating more original content through AI systems. The paper addresses\u0000challenges in assessing the language generation capabilities of AI models and\u0000emphasizes the importance of datasets that reflect the complexities of human-AI\u0000collaborative writing. Through systematic preprocessing and rigorous\u0000statistical analysis, this study offers valuable insights into the evolving\u0000landscape of AI-generated content and informs future developments in natural\u0000language processing (NLP).","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}