Jiawen Zhang , Dongliang Han , Shuai Han , Heng Li , Wing-Kai Lam , Mingyu Zhang
{"title":"ChatMatch:探索视觉-语言混合深度学习方法在球拍类运动智能分析和推理中的潜力","authors":"Jiawen Zhang , Dongliang Han , Shuai Han , Heng Li , Wing-Kai Lam , Mingyu Zhang","doi":"10.1016/j.csl.2024.101694","DOIUrl":null,"url":null,"abstract":"<div><p>Video understanding technology has become increasingly important in various disciplines, yet current approaches have primarily focused on lower comprehension level of video content, posing challenges for providing comprehensive and professional insights at a higher comprehension level. Video analysis plays a crucial role in athlete training and strategy development in racket sports. This study aims to demonstrate an innovative and higher-level video comprehension framework (ChatMatch), which integrates computer vision technologies with the cutting-edge large language models (LLM) to enable intelligent analysis and inference of racket sports videos. To examine the feasibility of this framework, we deployed a prototype of ChatMatch in the badminton in this study. A vision-based encoder was first proposed to extract the meta-features included the locations, actions, gestures, and action results of players in each frame of racket match videos, followed by a rule-based decoding method to transform the extracted information in both structured knowledge and unstructured knowledge. A set of LLM-based agents included namely task identifier, coach agent, statistician agent, and video manager, was developed through a prompt engineering and driven by an automated mechanism. The automatic collaborative interaction among the agents enabled the provision of a comprehensive response to professional inquiries from users. The validation findings showed that our vision models had excellent performances in meta-feature extraction, achieving a location identification accuracy of 0.991, an action recognition accuracy of 0.902, and a gesture recognition accuracy of 0.950. Additionally, a total of 100 questions were gathered from four proficient badminton players and one coach to evaluate the performance of the LLM-based agents, and the outcomes obtained from ChatMatch exhibited commendable results across general inquiries, statistical queries, and video retrieval tasks. These findings highlight the potential of using this approach that can offer valuable insights for athletes and coaches while significantly improve the efficiency of sports video analysis.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101694"},"PeriodicalIF":3.1000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000779/pdfft?md5=2c72701b559ac872232548320e08722b&pid=1-s2.0-S0885230824000779-main.pdf","citationCount":"0","resultStr":"{\"title\":\"ChatMatch: Exploring the potential of hybrid vision–language deep learning approach for the intelligent analysis and inference of racket sports\",\"authors\":\"Jiawen Zhang , Dongliang Han , Shuai Han , Heng Li , Wing-Kai Lam , Mingyu Zhang\",\"doi\":\"10.1016/j.csl.2024.101694\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Video understanding technology has become increasingly important in various disciplines, yet current approaches have primarily focused on lower comprehension level of video content, posing challenges for providing comprehensive and professional insights at a higher comprehension level. Video analysis plays a crucial role in athlete training and strategy development in racket sports. This study aims to demonstrate an innovative and higher-level video comprehension framework (ChatMatch), which integrates computer vision technologies with the cutting-edge large language models (LLM) to enable intelligent analysis and inference of racket sports videos. To examine the feasibility of this framework, we deployed a prototype of ChatMatch in the badminton in this study. A vision-based encoder was first proposed to extract the meta-features included the locations, actions, gestures, and action results of players in each frame of racket match videos, followed by a rule-based decoding method to transform the extracted information in both structured knowledge and unstructured knowledge. A set of LLM-based agents included namely task identifier, coach agent, statistician agent, and video manager, was developed through a prompt engineering and driven by an automated mechanism. The automatic collaborative interaction among the agents enabled the provision of a comprehensive response to professional inquiries from users. The validation findings showed that our vision models had excellent performances in meta-feature extraction, achieving a location identification accuracy of 0.991, an action recognition accuracy of 0.902, and a gesture recognition accuracy of 0.950. Additionally, a total of 100 questions were gathered from four proficient badminton players and one coach to evaluate the performance of the LLM-based agents, and the outcomes obtained from ChatMatch exhibited commendable results across general inquiries, statistical queries, and video retrieval tasks. These findings highlight the potential of using this approach that can offer valuable insights for athletes and coaches while significantly improve the efficiency of sports video analysis.</p></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"89 \",\"pages\":\"Article 101694\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0885230824000779/pdfft?md5=2c72701b559ac872232548320e08722b&pid=1-s2.0-S0885230824000779-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230824000779\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000779","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
ChatMatch: Exploring the potential of hybrid vision–language deep learning approach for the intelligent analysis and inference of racket sports
Video understanding technology has become increasingly important in various disciplines, yet current approaches have primarily focused on lower comprehension level of video content, posing challenges for providing comprehensive and professional insights at a higher comprehension level. Video analysis plays a crucial role in athlete training and strategy development in racket sports. This study aims to demonstrate an innovative and higher-level video comprehension framework (ChatMatch), which integrates computer vision technologies with the cutting-edge large language models (LLM) to enable intelligent analysis and inference of racket sports videos. To examine the feasibility of this framework, we deployed a prototype of ChatMatch in the badminton in this study. A vision-based encoder was first proposed to extract the meta-features included the locations, actions, gestures, and action results of players in each frame of racket match videos, followed by a rule-based decoding method to transform the extracted information in both structured knowledge and unstructured knowledge. A set of LLM-based agents included namely task identifier, coach agent, statistician agent, and video manager, was developed through a prompt engineering and driven by an automated mechanism. The automatic collaborative interaction among the agents enabled the provision of a comprehensive response to professional inquiries from users. The validation findings showed that our vision models had excellent performances in meta-feature extraction, achieving a location identification accuracy of 0.991, an action recognition accuracy of 0.902, and a gesture recognition accuracy of 0.950. Additionally, a total of 100 questions were gathered from four proficient badminton players and one coach to evaluate the performance of the LLM-based agents, and the outcomes obtained from ChatMatch exhibited commendable results across general inquiries, statistical queries, and video retrieval tasks. These findings highlight the potential of using this approach that can offer valuable insights for athletes and coaches while significantly improve the efficiency of sports video analysis.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.