Randolph M. Jones, Robert Bixler, Robert P. Marinier, Lilia V. Moshkina
Traditional classification approaches are straightforward: collect data, apply classification algorithms, then generate classification results. However, such approaches depend on data being amply available, which is not always the case. This paper describes an approach to maximize the utility of collected data through intelligent guidance of the data collection process. We present the development and evaluation of a knowledge-based decision-support system: the Logical Reasoner (LR), which guides data collection by unmanned ground and air assets to improve behavior classification. The LR is a component of a Human Directed and Controlled AI system (or “Human-AI” system) aimed at semi-autonomous classification of potential threat and non-threat individuals in a complex urban setting. The setting provides little to no pre-existing data; thus, the system collects, analyzes, and evaluates real-time human behavior data to determine whether the observed behavior is indicative of threat intent. The LR’s purpose is to produce contextual knowledge to help make productive decisions about where, when, and how to guide the vehicles in the data collection process. It builds a situational-awareness picture from the observed spatial relationships, activities, and interim classifications, then uses heuristics to generate new information-gathering goals, as well as to recommend which actions the vehicles should take to better achieve these goals. The system uses these recommendations to collaboratively help the operator direct the autonomous assets to individuals or places in the environment to maximize the effectiveness of evidence collection. LR is based on the Soar Cognitive Architecture which excels in supporting Human-AI collaboration. The described DoD-sponsored system has been developed and extensively tested for over three years, in simulation and in the field (with role-players). Results of these experiments have demonstrated that the LR decision support contributes to automated data collection and overall classification accuracy by the Human-AI team. This paper describes the development and evaluation of the LR based on multiple test events.The research reported in this document was performed under Defense Advanced Research Projects Agency (DARPA) contract #HR001120C0180, Urban Reconnaissance through Supervised Autonomy (URSA). The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon. Many thanks to Robert Marinier and Kris Kearns for their assistance in the preparation of this manuscript, as well as the entire ISOLATE R&D team.Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
{"title":"Automated Decision Support for Collaborative, Interactive Classification","authors":"Randolph M. Jones, Robert Bixler, Robert P. Marinier, Lilia V. Moshkina","doi":"10.54941/ahfe1003269","DOIUrl":"https://doi.org/10.54941/ahfe1003269","url":null,"abstract":"Traditional classification approaches are straightforward: collect data, apply classification algorithms, then generate classification results. However, such approaches depend on data being amply available, which is not always the case. This paper describes an approach to maximize the utility of collected data through intelligent guidance of the data collection process. We present the development and evaluation of a knowledge-based decision-support system: the Logical Reasoner (LR), which guides data collection by unmanned ground and air assets to improve behavior classification. The LR is a component of a Human Directed and Controlled AI system (or “Human-AI” system) aimed at semi-autonomous classification of potential threat and non-threat individuals in a complex urban setting. The setting provides little to no pre-existing data; thus, the system collects, analyzes, and evaluates real-time human behavior data to determine whether the observed behavior is indicative of threat intent. The LR’s purpose is to produce contextual knowledge to help make productive decisions about where, when, and how to guide the vehicles in the data collection process. It builds a situational-awareness picture from the observed spatial relationships, activities, and interim classifications, then uses heuristics to generate new information-gathering goals, as well as to recommend which actions the vehicles should take to better achieve these goals. The system uses these recommendations to collaboratively help the operator direct the autonomous assets to individuals or places in the environment to maximize the effectiveness of evidence collection. LR is based on the Soar Cognitive Architecture which excels in supporting Human-AI collaboration. The described DoD-sponsored system has been developed and extensively tested for over three years, in simulation and in the field (with role-players). Results of these experiments have demonstrated that the LR decision support contributes to automated data collection and overall classification accuracy by the Human-AI team. This paper describes the development and evaluation of the LR based on multiple test events.The research reported in this document was performed under Defense Advanced Research Projects Agency (DARPA) contract #HR001120C0180, Urban Reconnaissance through Supervised Autonomy (URSA). The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon. Many thanks to Robert Marinier and Kris Kearns for their assistance in the preparation of this manuscript, as well as the entire ISOLATE R&D team.Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)","PeriodicalId":405313,"journal":{"name":"Artificial Intelligence and Social Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114895713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
“Common ground” is the knowledge, facts, beliefs, etc. that are shared between participants in some joint activity. Much of human conversation concerns “grounding,” or ensuring that some assertion is actually shared between participants. Even for highly trained tasks, such teammates executing a military mission, each participant devotes attention to contributing new assertions, making adjustments based on the statements of others, offering potential repairs to resolve potential discrepancies in the common ground and so forth.In conversational interactions between humans and machines (or “agents”), this activity to build and to maintain a common ground is typically one-sided and fixed. It is one-sided because the human must do almost all the work of creating substantive common ground in the interaction. It is fixed because the agent does not adapt its understanding to what the human knows, prefers, and expects. Instead, the human must adapt to the agent. These limitations create burdensome cognitive demand, result in frustration and distrust in automation, and make the notion of an agent “teammate” seem an ambition far from reachable in today’s state-of-art. We are seeking to enable agents to more fully partner in building and maintaining common ground as well as to enable them to adapt their understanding of a joint activity. While “common ground” is often called out as a gap in human-machine teaming, there is not an extant, detailed analysis of the components of common ground and a mapping of these components to specific classes of functions (what specific agent capabilities is required to achieve common ground?) and deficits (what kinds of errors may arise when the functions are insufficient for a particular component of the common ground?). In this paper, we provide such an analysis, focusing on the requirements for human-machine teaming in a military context where interactions are task-oriented and generally well-trained.Drawing on the literature of human communication, we identify the components of information included in common ground. We identify three main axes: the temporal dimension of common ground and personal and communal common ground. The analysis further subdivides these distinctions, differentiating between aspects of the common ground such as personal history between participants, norms and expectations based on those norms, and the extent to which actions taken by participants in a human-machine interaction context are “public” events or not. Within each dimension, we also provide examples of specific issues that may arise due to problems due to lack of common ground related to a specific dimension. The analysis thus defines, at a more granular level than existing analyses, how specific categories of deficits in shared knowledge or processing differences manifests in misalignment in shared understanding. The paper both identifies specific challenges and prioritizes them according to acuteness of need. In other words, not all of
{"title":"Improving Common Ground in Human-Machine Teaming: Dimensions, Gaps, and Priorities","authors":"Robert Wray, James R. Kirk, J. Folsom-Kovarik","doi":"10.54941/ahfe1001463","DOIUrl":"https://doi.org/10.54941/ahfe1001463","url":null,"abstract":"“Common ground” is the knowledge, facts, beliefs, etc. that are shared between participants in some joint activity. Much of human conversation concerns “grounding,” or ensuring that some assertion is actually shared between participants. Even for highly trained tasks, such teammates executing a military mission, each participant devotes attention to contributing new assertions, making adjustments based on the statements of others, offering potential repairs to resolve potential discrepancies in the common ground and so forth.In conversational interactions between humans and machines (or “agents”), this activity to build and to maintain a common ground is typically one-sided and fixed. It is one-sided because the human must do almost all the work of creating substantive common ground in the interaction. It is fixed because the agent does not adapt its understanding to what the human knows, prefers, and expects. Instead, the human must adapt to the agent. These limitations create burdensome cognitive demand, result in frustration and distrust in automation, and make the notion of an agent “teammate” seem an ambition far from reachable in today’s state-of-art. We are seeking to enable agents to more fully partner in building and maintaining common ground as well as to enable them to adapt their understanding of a joint activity. While “common ground” is often called out as a gap in human-machine teaming, there is not an extant, detailed analysis of the components of common ground and a mapping of these components to specific classes of functions (what specific agent capabilities is required to achieve common ground?) and deficits (what kinds of errors may arise when the functions are insufficient for a particular component of the common ground?). In this paper, we provide such an analysis, focusing on the requirements for human-machine teaming in a military context where interactions are task-oriented and generally well-trained.Drawing on the literature of human communication, we identify the components of information included in common ground. We identify three main axes: the temporal dimension of common ground and personal and communal common ground. The analysis further subdivides these distinctions, differentiating between aspects of the common ground such as personal history between participants, norms and expectations based on those norms, and the extent to which actions taken by participants in a human-machine interaction context are “public” events or not. Within each dimension, we also provide examples of specific issues that may arise due to problems due to lack of common ground related to a specific dimension. The analysis thus defines, at a more granular level than existing analyses, how specific categories of deficits in shared knowledge or processing differences manifests in misalignment in shared understanding. The paper both identifies specific challenges and prioritizes them according to acuteness of need. In other words, not all of","PeriodicalId":405313,"journal":{"name":"Artificial Intelligence and Social Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129558305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dimitrios Ziakkas, Anastasios Plioutsias, K. Pechlivanis
Innovation, management of change, and human factors implementation in-flight operations portray the aviation industry. The International Air Transportation Authority (IATA) Technology Roadmap (IATA, 2019) and European Aviation Safety Agency (EASA) Artificial Intelligence (A.I.) roadmap propose an outline and assessment of ongoing technology prospects, which change the aviation environment with the implementation of A.I. and introduction of extended Minimum Crew Operations (eMCO) and Single Pilot Operations (SiPO). Changes in the workload will affect human performance and the decision-making process. The research accepted the universally established definition in the A.I. approach of “any technology that appears to emulate the performance of a human” (EASA, 2020). A review of the existing literature on Direct Voice Inputs (DVI) applications structured A.I. aviation decision-making research themes in cockpit design and users’ perception - experience. Interviews with Subject Matter Experts (Human Factors analysts, A.I. analysts, airline managers, examiners, instructors, qualified pilots, pilots under training) and questionnaires (disseminated to a group of professional pilots and pilots under training) examined A.I. implementation in cockpit design and operations. Results were analyzed and evaluated the suitability and significant differences of e-MCO and SiPO under the decision-making aspect.Keywords: Artificial Intelligence (A.I.), Extended Minimum Crew Operations (e-MCO), Single Pilot Operations (SiPO), cockpit design, ergonomics, decision making.
{"title":"Artificial Intelligence in aviation decision making process.The transition from extended Minimum Crew Operations to Single Pilot Operations (SiPO)","authors":"Dimitrios Ziakkas, Anastasios Plioutsias, K. Pechlivanis","doi":"10.54941/ahfe1001452","DOIUrl":"https://doi.org/10.54941/ahfe1001452","url":null,"abstract":"Innovation, management of change, and human factors implementation in-flight operations portray the aviation industry. The International Air Transportation Authority (IATA) Technology Roadmap (IATA, 2019) and European Aviation Safety Agency (EASA) Artificial Intelligence (A.I.) roadmap propose an outline and assessment of ongoing technology prospects, which change the aviation environment with the implementation of A.I. and introduction of extended Minimum Crew Operations (eMCO) and Single Pilot Operations (SiPO). Changes in the workload will affect human performance and the decision-making process. The research accepted the universally established definition in the A.I. approach of “any technology that appears to emulate the performance of a human” (EASA, 2020). A review of the existing literature on Direct Voice Inputs (DVI) applications structured A.I. aviation decision-making research themes in cockpit design and users’ perception - experience. Interviews with Subject Matter Experts (Human Factors analysts, A.I. analysts, airline managers, examiners, instructors, qualified pilots, pilots under training) and questionnaires (disseminated to a group of professional pilots and pilots under training) examined A.I. implementation in cockpit design and operations. Results were analyzed and evaluated the suitability and significant differences of e-MCO and SiPO under the decision-making aspect.Keywords: Artificial Intelligence (A.I.), Extended Minimum Crew Operations (e-MCO), Single Pilot Operations (SiPO), cockpit design, ergonomics, decision making.","PeriodicalId":405313,"journal":{"name":"Artificial Intelligence and Social Computing","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128770263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The COVID-19 pandemic affected the world. The World Health Organization or WHO issued guidelines the public must follow to prevent the spread of the disease. This includes social distancing, the wearing of facemasks, and regular washing of hands. These guidelines served as the basis for formulating policies by countries affected by the pandemic. In the Philippines, the government implemented different initiatives, following the guidelines of WHO, that aimed to mitigate the effect of the pandemic in the country. Some of the initiatives formulated by the administration include international and domestic travel restrictions, community quarantine, suspension of face-to-face classes and work arrangements, and phased reopening of the Philippine economy to name a few. The initiatives implemented by the government during the surge of COVID-19 disease have resulted in varying reactions from the citizens. The citizens expressed their reactions to these initiatives using different social media platforms such as Twitter and Facebook. The reactions expressed using these social media platforms were used to analyze the sentiment of the citizens towards the initiatives implemented by the government during the pandemic. In this study, a Bidirectional Recurrent Neural Network-Long Short-term memory - Support Vector Machine (BRNN-LSTM-SVM) hybrid sentiment classifier model was used to determine the sentiments of the Philippine public toward the initiatives of the Philippine government to mitigate the effects of the COVID-19 pandemic. The dataset used was collected and extracted from Facebook and Twitter using API and www.exportcomments.com from March 2020 to August 2020. 25% of the dataset was manually annotated by two human annotators. The manually annotated dataset was used to build the COVID-19 context-based sentiment lexicon, which was later used to determine the polarity of each document. Since the dataset contained unstructured and noisy data, preprocessing activities such as conversion to lowercase characters, removal of stopwords, removal of usernames and pure digit texts, and translation to the English language were performed. The preprocessed dataset was vectorized using Glove word embedding and was used to train and test the performance of the proposed model. The performance of the Hybrid BRNN-LSTM-SVM model was compared to BRNN-LSTM and SVM by performing experiments using the preprocessed dataset. The results show that the Hybrid BRNN-LSTM-SVM model, which gained 95% accuracy for the Facebook dataset and 93% accuracy for the Twitter dataset, outperformed the Support Vector Machine (SVM) sentiment model whose accuracy only ranges from 89% to 91% for both datasets. The results indicate that the citizens harbor negative sentiments towards the initiatives of the government in mitigating the effect of the COVID-19 pandemic. The results of the study may be used in reviewing the initiatives imposed during the pandemic to determine the issues which concern the
{"title":"Analysis of citizen's sentiment towards Philippine administration's intervention against COVID-19","authors":"Matthew John Sino Cruz, M. D. De Leon","doi":"10.54941/ahfe1001446","DOIUrl":"https://doi.org/10.54941/ahfe1001446","url":null,"abstract":"The COVID-19 pandemic affected the world. The World Health Organization or WHO issued guidelines the public must follow to prevent the spread of the disease. This includes social distancing, the wearing of facemasks, and regular washing of hands. These guidelines served as the basis for formulating policies by countries affected by the pandemic. In the Philippines, the government implemented different initiatives, following the guidelines of WHO, that aimed to mitigate the effect of the pandemic in the country. Some of the initiatives formulated by the administration include international and domestic travel restrictions, community quarantine, suspension of face-to-face classes and work arrangements, and phased reopening of the Philippine economy to name a few. The initiatives implemented by the government during the surge of COVID-19 disease have resulted in varying reactions from the citizens. The citizens expressed their reactions to these initiatives using different social media platforms such as Twitter and Facebook. The reactions expressed using these social media platforms were used to analyze the sentiment of the citizens towards the initiatives implemented by the government during the pandemic. In this study, a Bidirectional Recurrent Neural Network-Long Short-term memory - Support Vector Machine (BRNN-LSTM-SVM) hybrid sentiment classifier model was used to determine the sentiments of the Philippine public toward the initiatives of the Philippine government to mitigate the effects of the COVID-19 pandemic. The dataset used was collected and extracted from Facebook and Twitter using API and www.exportcomments.com from March 2020 to August 2020. 25% of the dataset was manually annotated by two human annotators. The manually annotated dataset was used to build the COVID-19 context-based sentiment lexicon, which was later used to determine the polarity of each document. Since the dataset contained unstructured and noisy data, preprocessing activities such as conversion to lowercase characters, removal of stopwords, removal of usernames and pure digit texts, and translation to the English language were performed. The preprocessed dataset was vectorized using Glove word embedding and was used to train and test the performance of the proposed model. The performance of the Hybrid BRNN-LSTM-SVM model was compared to BRNN-LSTM and SVM by performing experiments using the preprocessed dataset. The results show that the Hybrid BRNN-LSTM-SVM model, which gained 95% accuracy for the Facebook dataset and 93% accuracy for the Twitter dataset, outperformed the Support Vector Machine (SVM) sentiment model whose accuracy only ranges from 89% to 91% for both datasets. The results indicate that the citizens harbor negative sentiments towards the initiatives of the government in mitigating the effect of the COVID-19 pandemic. The results of the study may be used in reviewing the initiatives imposed during the pandemic to determine the issues which concern the ","PeriodicalId":405313,"journal":{"name":"Artificial Intelligence and Social Computing","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124653833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the spread of Major Depressive Disorder, otherwise known simply as depression, around the world, various efforts have been made to combat it and to potentially reach out to those suffering from it. Part of those efforts includes the use of technology, such as machine learning models, to screen a potential person for depression through various means, including social media narratives, such as tweets from Twitter. Hence, this study aims to evaluate how well a pre-trained DistilBERT, a transformer model for natural language processing that was fine-tuned on a set of tweets coming from depressed and non-depressed users, can detect potential users in Twitter as having depression. Two models were built using the same procedure of preprocessing, splitting, tokenizing, training, fine-tuning, and optimizing. Both the Base Model (trained on CLPsych 2015 Dataset) and the Mixed Model (trained on the CLPsych 2015 Dataset and a half of the dataset of scraped tweets) could detect potential users in Twitter for depression more than half of the time by demonstrating an Area under the Receiver Operating Curve (AUC) score of 65% and 63%, respectively, when evaluated using the test dataset. These models performed comparably in identifying potential depressed users in Twitter given that there was no significant difference in their AUC scores when subjected to a z-test at 95% confidence interval and 0.05 level of significance (p = 0.21). These results suggest DistilBERT, when fine-tuned, may be used to detect potential users in Twitter for depression.
{"title":"Detecting Potential Depressed Users in Twitter Using a Fine-tuned DistilBERT Model","authors":"Miguel Antonio Adarlo, M. D. De Leon","doi":"10.54941/ahfe1001458","DOIUrl":"https://doi.org/10.54941/ahfe1001458","url":null,"abstract":"With the spread of Major Depressive Disorder, otherwise known simply as depression, around the world, various efforts have been made to combat it and to potentially reach out to those suffering from it. Part of those efforts includes the use of technology, such as machine learning models, to screen a potential person for depression through various means, including social media narratives, such as tweets from Twitter. Hence, this study aims to evaluate how well a pre-trained DistilBERT, a transformer model for natural language processing that was fine-tuned on a set of tweets coming from depressed and non-depressed users, can detect potential users in Twitter as having depression. Two models were built using the same procedure of preprocessing, splitting, tokenizing, training, fine-tuning, and optimizing. Both the Base Model (trained on CLPsych 2015 Dataset) and the Mixed Model (trained on the CLPsych 2015 Dataset and a half of the dataset of scraped tweets) could detect potential users in Twitter for depression more than half of the time by demonstrating an Area under the Receiver Operating Curve (AUC) score of 65% and 63%, respectively, when evaluated using the test dataset. These models performed comparably in identifying potential depressed users in Twitter given that there was no significant difference in their AUC scores when subjected to a z-test at 95% confidence interval and 0.05 level of significance (p = 0.21). These results suggest DistilBERT, when fine-tuned, may be used to detect potential users in Twitter for depression.","PeriodicalId":405313,"journal":{"name":"Artificial Intelligence and Social Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133532843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Crowdsourcing has rapidly become a computing paradigm in machine learning and artificial intelligence. In crowdsourcing, multiple labels are collected from crowd-workers on an instance usually through the Internet. These labels are then aggregated as a single label to match the ground truth of the instance. Due to its open nature, human workers in crowdsourcing usually come with various levels of knowledge and socio-economic backgrounds. Effectively handling such human factors has been a focus in the study and applications of crowdsourcing. For example, Bi et al studied the impacts of worker's dedication, expertise, judgment, and task difficulty (Bi et al 2014). Qiu et al offered methods for selecting workers based on behavior prediction (Qiu et al 2016). Barbosa and Chen suggested rehumanizing crowdsourcing to deal with human biases (Barbosa 2019). Checco et al studied adversarial attacks on crowdsourcing for quality control (Checco et al 2020). There are many more related works available in literature. In contrast to commonly used binary-valued labels, interval-valued labels (IVLs) have been introduced very recently (Hu et al 2021). Applying statistical and probabilistic properties of interval-valued datasets, Spurling et al quantitatively defined worker's reliability in four measures: correctness, confidence, stability, and predictability (Spurling et al 2021). Calculating these measures, except correctness, does not require the ground truth of each instance but only worker’s IVLs. Applying these quantified reliability measures, people have significantly improved the overall quality of crowdsourcing (Spurling et al 2022). However, in real world applications, the reliability of a worker may vary from time to time rather than a constant. It is necessary to monitor worker’s reliability dynamically. Because a worker j labels instances sequentially, we treat j’s IVLs as an interval-valued time series in our approach. Assuming j’s reliability relies on the IVLs within a time window only, we calculate j’s reliability measures with the IVLs in the current time window. Moving the time window forward with our proposed practical strategies, we can monitor j’s reliability dynamically. Furthermore, the four reliability measures derived from IVLs are time varying too. With regression analysis, we can separate each reliability measure as an explainable trend and possible errors. To validate our approaches, we use four real world benchmark datasets in our computational experiments. Here are the main findings. The reliability weighted interval majority voting (WIMV) and weighted preferred matching probability (WPMP) schemes consistently overperform the base schemes in terms of much higher accuracy, precision, recall, and F1-score. Note: the base schemes are majority voting (MV), interval majority voting (IMV), and preferred matching probability (PMP). Through monitoring worker’s reliability, our computational experiments have successfully identified possible a
众包已经迅速成为机器学习和人工智能领域的计算范式。在众包中,通常通过互联网从一个实例的众包工作者那里收集多个标签。然后将这些标签聚合为单个标签,以匹配实例的基本事实。由于其开放性,众包中的人力工作者通常具有不同的知识水平和社会经济背景。有效处理这些人为因素一直是众包研究和应用的重点。例如,Bi等人研究了员工敬业度、专业知识、判断力和任务难度的影响(Bi et al 2014)。Qiu等人提供了基于行为预测的工人选择方法(Qiu et al . 2016)。Barbosa和Chen建议将众包重新人性化,以应对人类偏见(Barbosa 2019)。Checco等人研究了针对质量控制的众包的对抗性攻击(Checco等人2020)。文献中还有很多相关的作品。与常用的二值标签不同,区间值标签(ivl)是最近才引入的(Hu et al . 2021)。利用区间值数据集的统计和概率特性,Spurling等人在四个方面定量定义了工人的可靠性:正确性、置信度、稳定性和可预测性(Spurling等人2021)。计算这些度量,除了正确性,不需要每个实例的基本事实,而只需要工人的ivl。应用这些量化的可靠性措施,人们显著提高了众包的整体质量(Spurling et al 2022)。然而,在现实世界的应用程序中,工作器的可靠性可能会不时变化,而不是恒定的。对工人的可靠性进行动态监测是必要的。因为工人j按顺序标记实例,所以在我们的方法中,我们将j的ivl视为区间值时间序列。假设j的可靠性仅依赖于一个时间窗口内的ivl,我们用当前时间窗口内的ivl计算j的可靠性度量。利用我们提出的实用策略,将时间窗口向前推进,我们可以动态地监控j的可靠性。此外,由ivl导出的四种可靠性度量也具有时变特性。通过回归分析,我们可以将每个可靠性度量分离为可解释的趋势和可能的误差。为了验证我们的方法,我们在计算实验中使用了四个真实世界的基准数据集。以下是主要发现。可靠性加权区间多数投票(WIMV)和加权首选匹配概率(WPMP)方案在更高的准确性、精密度、召回率和f1分数方面始终优于基本方案。注意:基本方案是多数投票(MV)、间隔多数投票(IMV)和首选匹配概率(PMP)。通过监测工作人员的可靠性,我们的计算实验成功地识别了可能的攻击者。通过移除已识别的攻击者,我们确保了质量。我们还研究了窗口大小选择的影响。动态监测工人的可靠性是必要的,计算结果表明了所提方法的潜在成功。本研究得到了美国国家科学基金会NSF/OIA-1946391项目的部分资助。
{"title":"Dynamically monitoring crowd-worker's reliability with interval-valued labels","authors":"Chenyi Hu, Makenzie Spurling","doi":"10.54941/ahfe1003270","DOIUrl":"https://doi.org/10.54941/ahfe1003270","url":null,"abstract":"Crowdsourcing has rapidly become a computing paradigm in machine learning and artificial intelligence. In crowdsourcing, multiple labels are collected from crowd-workers on an instance usually through the Internet. These labels are then aggregated as a single label to match the ground truth of the instance. Due to its open nature, human workers in crowdsourcing usually come with various levels of knowledge and socio-economic backgrounds. Effectively handling such human factors has been a focus in the study and applications of crowdsourcing. For example, Bi et al studied the impacts of worker's dedication, expertise, judgment, and task difficulty (Bi et al 2014). Qiu et al offered methods for selecting workers based on behavior prediction (Qiu et al 2016). Barbosa and Chen suggested rehumanizing crowdsourcing to deal with human biases (Barbosa 2019). Checco et al studied adversarial attacks on crowdsourcing for quality control (Checco et al 2020). There are many more related works available in literature. In contrast to commonly used binary-valued labels, interval-valued labels (IVLs) have been introduced very recently (Hu et al 2021). Applying statistical and probabilistic properties of interval-valued datasets, Spurling et al quantitatively defined worker's reliability in four measures: correctness, confidence, stability, and predictability (Spurling et al 2021). Calculating these measures, except correctness, does not require the ground truth of each instance but only worker’s IVLs. Applying these quantified reliability measures, people have significantly improved the overall quality of crowdsourcing (Spurling et al 2022). However, in real world applications, the reliability of a worker may vary from time to time rather than a constant. It is necessary to monitor worker’s reliability dynamically. Because a worker j labels instances sequentially, we treat j’s IVLs as an interval-valued time series in our approach. Assuming j’s reliability relies on the IVLs within a time window only, we calculate j’s reliability measures with the IVLs in the current time window. Moving the time window forward with our proposed practical strategies, we can monitor j’s reliability dynamically. Furthermore, the four reliability measures derived from IVLs are time varying too. With regression analysis, we can separate each reliability measure as an explainable trend and possible errors. To validate our approaches, we use four real world benchmark datasets in our computational experiments. Here are the main findings. The reliability weighted interval majority voting (WIMV) and weighted preferred matching probability (WPMP) schemes consistently overperform the base schemes in terms of much higher accuracy, precision, recall, and F1-score. Note: the base schemes are majority voting (MV), interval majority voting (IMV), and preferred matching probability (PMP). Through monitoring worker’s reliability, our computational experiments have successfully identified possible a","PeriodicalId":405313,"journal":{"name":"Artificial Intelligence and Social Computing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115704897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For patients scheduled for surgery, long waiting times are unpleasant. However, scheduling that is too patient-oriented can lead to friction losses in the operating room and waiting times for the medical personnel. We have conducted an analysis of historical hand surgery data to improve forecasting of hand surgery durations, optimize operation room scheduling for physicians and patients and reduce overall waiting times. Several models have been evaluated to forecast surgery durations. A quantile-based approach based on the distribution of surgery durations has been tested in a scheduling simulation. This approach has indicated possibilities to gradually balance waiting times between patients and medical personnel. Within a field trial, a trained regression model has been successfully deployed in a hand surgery operation center.
{"title":"A machine learning approach for optimizing waiting times in a hand surgery operation center","authors":"A. Schuller, M. Braun, Peter Hahn","doi":"10.54941/ahfe1003268","DOIUrl":"https://doi.org/10.54941/ahfe1003268","url":null,"abstract":"For patients scheduled for surgery, long waiting times are unpleasant. However, scheduling that is too patient-oriented can lead to friction losses in the operating room and waiting times for the medical personnel. We have conducted an analysis of historical hand surgery data to improve forecasting of hand surgery durations, optimize operation room scheduling for physicians and patients and reduce overall waiting times. Several models have been evaluated to forecast surgery durations. A quantile-based approach based on the distribution of surgery durations has been tested in a scheduling simulation. This approach has indicated possibilities to gradually balance waiting times between patients and medical personnel. Within a field trial, a trained regression model has been successfully deployed in a hand surgery operation center.","PeriodicalId":405313,"journal":{"name":"Artificial Intelligence and Social Computing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114648080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is a considerable body of research on trust in Artificial Intelligence (AI). Trust has been viewed almost exclusively as a dyadic construct, where it is a function of various factors between the user and the agent, mediated by the context of the environment. A recent study has found several cases of supradyadic trust interactions, where a user’s trust in the AI is affected by how other people interact with the agent, above and beyond endorsements or reputation. An analysis of these surpradyadic interactions is presented, along with a discussion of practical considerations for AI developers, and an argument for more complex representations of trust in AI.
{"title":"Supradyadic Trust in Artificial Intelligence","authors":"Stephen L. Dorton","doi":"10.54941/ahfe1001451","DOIUrl":"https://doi.org/10.54941/ahfe1001451","url":null,"abstract":"There is a considerable body of research on trust in Artificial Intelligence (AI). Trust has been viewed almost exclusively as a dyadic construct, where it is a function of various factors between the user and the agent, mediated by the context of the environment. A recent study has found several cases of supradyadic trust interactions, where a user’s trust in the AI is affected by how other people interact with the agent, above and beyond endorsements or reputation. An analysis of these surpradyadic interactions is presented, along with a discussion of practical considerations for AI developers, and an argument for more complex representations of trust in AI.","PeriodicalId":405313,"journal":{"name":"Artificial Intelligence and Social Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130715313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao-Lung Yang, Shang-Che Hsu, Simi Wang, Jing-Feng Nian
Nowadays, human action recognition (HAR) has been applied in multiple fields with the rapid growth of artificial intelligence and machine learning. Applying HAR onto industrial production lines can help on visualizing and analyzing the correlation between human operators and machine utilization to improve overall productivity. However, to train HAR model, the manual labeling of certain actions in a large amount of the collected video data is required and very costly. How to label a large amount of video automatically is an emerging practical problem in HAR research domain. This research proposed an automatic labeling framework by integrating Dynamic Time Warping (DTW), human skeleton clustering, and Fuzzy similarity to assign the labels based on the pre-defined human actions. First, the skeleton estimation method such as OpenPose was used to jointly detect key points of the human operator’s skeleton. Then, the skeleton data was converted to spatial-temporal data for calculating the DTW distance between skeletons. The groups of human skeletons can be clustered based on DTW distance among skeletons. Within a group of skeletons, the undefined skeletons will be compared with the pre-defined skeletons, considered as the references, and the labels are assigned according to the similarity against the references. The experimental dataset was created by simulating the human actions of manual drilling operations. By comparing with the manual labeled data, the results show that all of accuracy, precision, recall, and F1 of the proposed labeling model can achieve up to 95% with 40% saving time.
如今,随着人工智能和机器学习的快速发展,人体动作识别(HAR)已被应用于多个领域。将HAR应用于工业生产线可以帮助可视化和分析操作员与机器利用率之间的相关性,从而提高整体生产率。然而,为了训练HAR模型,需要对收集到的大量视频数据中的某些动作进行人工标记,并且成本非常高。如何对大量视频进行自动标注是HAR研究领域中一个新兴的实际问题。该研究提出了一种基于预定义人类行为的自动标记框架,该框架将动态时间扭曲(Dynamic Time warp, DTW)、人体骨架聚类和模糊相似度相结合。首先,采用OpenPose等骨架估计方法,对人体操作员骨架关键点进行联合检测;然后,将骨架数据转换为时空数据,计算骨架之间的DTW距离。基于骨骼间的DTW距离可以对人类骨骼群进行聚类。在一组骨架中,未定义的骨架将与预定义的骨架进行比较,作为参考,并根据与参考的相似度分配标签。实验数据集是通过模拟人工钻井作业的人类行为来创建的。通过与手工标注数据的比较,结果表明,所提标注模型的准确率、精密度、召回率和F1均达到95%以上,节省时间40%。
{"title":"Automatic Labeling of Human Actions by Skeleton Clustering and Fuzzy Similarity","authors":"Chao-Lung Yang, Shang-Che Hsu, Simi Wang, Jing-Feng Nian","doi":"10.54941/ahfe1001457","DOIUrl":"https://doi.org/10.54941/ahfe1001457","url":null,"abstract":"Nowadays, human action recognition (HAR) has been applied in multiple fields with the rapid growth of artificial intelligence and machine learning. Applying HAR onto industrial production lines can help on visualizing and analyzing the correlation between human operators and machine utilization to improve overall productivity. However, to train HAR model, the manual labeling of certain actions in a large amount of the collected video data is required and very costly. How to label a large amount of video automatically is an emerging practical problem in HAR research domain. This research proposed an automatic labeling framework by integrating Dynamic Time Warping (DTW), human skeleton clustering, and Fuzzy similarity to assign the labels based on the pre-defined human actions. First, the skeleton estimation method such as OpenPose was used to jointly detect key points of the human operator’s skeleton. Then, the skeleton data was converted to spatial-temporal data for calculating the DTW distance between skeletons. The groups of human skeletons can be clustered based on DTW distance among skeletons. Within a group of skeletons, the undefined skeletons will be compared with the pre-defined skeletons, considered as the references, and the labels are assigned according to the similarity against the references. The experimental dataset was created by simulating the human actions of manual drilling operations. By comparing with the manual labeled data, the results show that all of accuracy, precision, recall, and F1 of the proposed labeling model can achieve up to 95% with 40% saving time.","PeriodicalId":405313,"journal":{"name":"Artificial Intelligence and Social Computing","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130820055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edwin Marte Zorrilla, I. Villanueva, J. Husman, Matthew C. Graham
Studies for stress and student performance with multimodal sensor measurements have been a recent topic of discussion among research educators. With the advances in computational hardware and the use of Machine learning strategies, scholars can now deal with data of high dimensionality and provide a way to predict new estimates for future research designs. In this paper, the process to generate and obtain a multimodal dataset including physiological measurements (e.g., electrodermal activity- EDA) from wearable devices is presented. Through the use of a Feature Generation Toolkit for Wearable Data, the time to extract clean and generate the data was reduced. A machine learning model from an openly available multimodal dataset was developed and results were compared against previous studies to evaluate the utility of these approaches and tools. Keywords: Engineering Education, Physiological Sensing, Student Performance, Machine Learning, Multimodal, FLIRT, WESAD
{"title":"Generating a Multimodal Dataset Using a Feature Extraction Toolkit for Wearable and Machine Learning: A pilot study","authors":"Edwin Marte Zorrilla, I. Villanueva, J. Husman, Matthew C. Graham","doi":"10.54941/ahfe1001448","DOIUrl":"https://doi.org/10.54941/ahfe1001448","url":null,"abstract":"Studies for stress and student performance with multimodal sensor measurements have been a recent topic of discussion among research educators. With the advances in computational hardware and the use of Machine learning strategies, scholars can now deal with data of high dimensionality and provide a way to predict new estimates for future research designs. In this paper, the process to generate and obtain a multimodal dataset including physiological measurements (e.g., electrodermal activity- EDA) from wearable devices is presented. Through the use of a Feature Generation Toolkit for Wearable Data, the time to extract clean and generate the data was reduced. A machine learning model from an openly available multimodal dataset was developed and results were compared against previous studies to evaluate the utility of these approaches and tools. Keywords: Engineering Education, Physiological Sensing, Student Performance, Machine Learning, Multimodal, FLIRT, WESAD","PeriodicalId":405313,"journal":{"name":"Artificial Intelligence and Social Computing","volume":"164 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120913185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}