AI: Can You Help Address This Issue?

IF 2.7 4区 教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH Educational Measurement-Issues and Practice Pub Date : 2024-11-10 DOI:10.1111/emip.12655
Deborah J. Harris
{"title":"AI: Can You Help Address This Issue?","authors":"Deborah J. Harris","doi":"10.1111/emip.12655","DOIUrl":null,"url":null,"abstract":"<p>Linking across test forms or pools of items is necessary to ensure scores that are reported across different administrations are comparable and lead to consistent decisions for examinees whose abilities are the same, but who were administered different items. Most of these linkages consist of equating test forms or scaling calibrated items or pools to be on the same theta scale. The typical methodology to accomplish this linking makes use of common examinees or common items, where common examinees are understood to be groups of examinees of comparable ability, whether obtained through a single group (where the same examinees are administered multiple assessments) or a random groups design, where random assignment or pseudo random assignment is done (such as spiraling the test forms, say 1, 2, 3, 4, 5, and distributing them such that every 5th examinee receives the same form). Common item methodology is usually implemented by having identical items in multiple forms and using those items to link across forms or pools. These common items may be scored or unscored in terms of whether they are treated as internal or external anchors (i.e., whether they are contributing to the examinee's score).</p><p>There are situations where it is not practical to have either common examinees nor common items. Typically, these are high-stakes settings, where the security of the assessment questions would likely be at risk if any were repeated. This would include scenarios where the entire assessment is released after administration to promote transparency. In some countries, a single form of a national test may be administered to all examinees during a single administration time. While in some cases a student who does not do as well as they had hoped may retest the following year, this may be a small sample and these students would not be considered representative of the entire body of test-takers. In addition, it is presumed they would have spent the intervening year studying for the exam, and so they could not really be considered common examinees across years and assessment forms.</p><p>Although the decisions (such as university admissions) based on the assessment scores are comparable within the year, because all examinees are administered the same set of items on the same date, it is difficult to monitor trends over time as there is no linkage between forms across years. Although the general populations may be similar (e.g., 2024 secondary school graduates versus 2023 secondary school graduates), there is no evidence that the groups are strictly equivalent across years. Similarly, comparing how examinees perform across years (e.g., highest scores, average raw score, and so on) is challenging as there is no adjustment for yearly fluctuations in form difficulty across years.</p><p>There have been variations of both common item and common examinee linking, such as using similar items, rather than identical items, including where perhaps these similar items are clones of each other, and using various types of matching techniques in an attempt to achieve common examinees by creating equivalent subgroups across forms. Cloning items or generating items from a template has had some success in terms of creating items of identical-ish difficulty. However, whether items that are clones of released items would still maintain their integrity and properties sufficiently to serve as linking items would need to be studied.</p><p>I, and many others, have been involved in several studies trying to accomplish linking for comparability where neither common items nor common examinees have been available. This short section provides a glimpse of some of that research.</p><p>Harris and Fang (<span>2015</span>) considered multiple options to address comparing assessment scores across years where there were no common items and no common examinees. Two of these options involved making an assumption, and the others involved making an adjustment. In the first instance, the assumption was made to treat the groups of examinees in different years as equivalent. This was solely an assumption, with no confirming evidence that the assumption was reasonable. Once the assumption was made, a random groups equating across years was conducted. The second method was to assume the test forms were built to be equivalent in difficulty, again with no evidence this was indeed the case. Because it was assumed the test forms were equivalent in difficulty, equating was not necessary (recall equating across test forms only adjusts for small differences in form difficulty; if there is no difference in difficulty, equating is unnecessary), and scores from the different forms in different years could be directly compared. The third option was to create subgroups of examinees to hopefully imitate equivalent groups of examinees being administered each of the forms, and then to conduct random groups equating. One of these subgroup options was created by using the middle 80% of the distributions of examinee scores for each year, and another used self-reported information the examinees provided, such as courses they had taken and course grades received, to create comparable subgroups across years.</p><p>Huh et al. (<span>2016, 2017</span>) expanded on Harris and Fang (<span>2015</span>), again examining alternative ways to utilize equating methodology in a context where items cannot be readministered, and the examinee groups being administered different test forms cannot be assumed to be equivalent groups. The authors referred to the methods they studied as “pseudo-equating,” as equating methodology was used, but the data assumptions associated with actual equating, such as truly randomly equivalent groups of examinees or common items, were absent. They included the two assumption-based methods Harris and Fang looked at, as well as the two ways of adjusting the examinee groups. Assuming the two forms were built to the same difficulty specifications and assuming the two samples of examinees were equivalent did not work as well as making an adjustment to try to form comparable subgroups. The two adjustments used in Harris and Fang were replicated: the middle 80% of each examinee distribution were used, the basis again being that perhaps the examinee groups would differ more in the tails than in the center, and matching group distributions based on additional information. When classical test theory equipercentile with post smoothing equating methodology was implemented, the score distribution of the subsequent year's group was adjusted based on weighting to match the initial group distribution based on variables thought to be important such as self-reported subject grades and extracurricular activities. When IRT true score equating methodology was used, comparable samples for the two groups were created for calibrations by matching the proportions within each stratum as defined by the related variables. In general, the attempts to match the groups performed better than simply making assumptions about the groups or form difficulty equivalences. Wu et al. (<span>2017</span>) also looked at trying to create matched sample distributions across two dispirit examinee groups, with similar results.</p><p>Kim and Walker (<span>2021</span>) investigated creating groups of similar ability using subgroup weighting, augmented by a small number of common items. Propensity score weighting, including variants such as coarsened exact matching, has also been used in an attempt to create equivalent groups of examinees (see, for example, Cho et al., <span>2024</span>; Kapoor et al., <span>2024</span>; Li et al., <span>2024</span>; Woods et al., <span>2024</span> who all looked at creating equivalent samples of test takers in the context of mode studies, where one group of examinees tested on a device and the other group tested on paper).</p><p>Propensity scores are “the conditional probability of assignment to a particular treatment given a vector of observed covariates” (Rosenbaum &amp; Rubin, <span>1983</span>, p. 41). In our scenario that translates to an examinee testing on one assessment form instead of the other. One key step in implementing propensity score matching is identifying the covariates to include. In our scenario that would involve those variables which would be available for both populations and be appropriately related to the variable of interest, and to end up with equivalent samples of examinees being administered the different test forms, to allow us to appropriately conduct a random groups equating across the two forms. For example, number of math classes taken, names of the specific math classes taken, grades in the individual math classes, overall grade point average for math classes, and so on are all possible covariates that could be included. What data is collected from examinees as well as at what level of granularity are decisions that need to be made. Whether the data on variables is self-reported or provided from a trusted source (e.g., self-reported course grades versus data from a transcript) also are considerations. How many covariates to include, whether exact matching is required, how to deal with missing data, deciding what matching algorithm to use, and so on are further decisions that need to be made.</p><p>Cloning items, generating items from a template, or other ways of finding “matching items” from a subsequent form to substitute as common items with the form one wants to link to has also been studied in a variety of settings. In our scenario the issue is whether the characteristics of the items that impact the assumptions of the common item equating method being used “match” the item it is being substituted for. Item content and item position should be fairly easy to assess. However, item statistics such as IRT parameters and classical difficulty and discrimination would be computed using the responses by the test takers administered the subsequent form, and therefore not directly comparable. That is, trying to match an item with a particular p-value and point-biserial from a form administered to one group using an item administered on a different form to a different group brings us full circle. The item statistics would be comparable across the two forms if they were computed on equivalent groups of test-takers (in which case we would not need common items, we could rely on common examinees to link). If cloned items or auto-generated items were shown to have comparable statistics, they could be considered common items when placed in different forms and administered to different groups, at least in theory.</p><p>Common items and common examinees are the two vehicles used in obtaining comparable scores across different forms of an assessment. Common items can be mimicked by having items of equivalent characteristics or by having unique items that have their item parameter estimates on a common theta scale. Research on using artificial intelligence to assist in estimating item parameters already exists, and one assumes is continuing to expand (Hao, et al., <span>2024</span>). Some of these initiatives have been around augmenting small sample sizes to reduce the sampling requirements for pretesting and calibrating new items as they are introduced. Examples include McCarthy et al. (<span>2021</span>) who used a “multi-task generalized linear model with BERT features” to provide initial estimates that are then refined as empirical data are collected; their methodology can also be used to calculate “new item difficulty estimates without piloting them first” (p. 883). Belov et al. (<span>2024</span>) used item pool information and trained a neural network to interpolate the relationship between item parameter estimates and response patterns. Zhang and Chen (<span>2024</span>) presented a “cross estimation network” consisting of a person network and an item network, finding that their approach produced accurate parameter estimates for items or persons, providing the other parameters were given (known). In a different approach, Maeda (<span>2024</span>) used “examinees” generated with AI to calibrate items; he determined the methodology had promise, but was not as accurate as having actual human responses to the items. AI has also been applied in the attempt to form equivalent subgroups in various settings. Monlezun et al. (<span>2022</span>) integrated AI and propensity scoring by using a propensity score adjustment, augmented by machine learning. Collier et al. (<span>2022</span>) demonstrated the application of artificial neural networks in a multilevel setting in propensity score estimation, stating “AI can be helpful in propensity score estimation because it can identify underlying patterns between treatments and confounding variables using machine learning without being explicitly programmed. Many classification algorithms in machine learning can outperform the classical methods for propensity score estimation, mainly when processing the data with many covariates…” (p. 3). The authors note that not all neural networks are created equal and that using different training methods and hyperparameters on the same data can yield different results. They also mention several computer packages and programming languages that are available to implement neural networks for propensity score estimation, including Python, R, and SAS.</p><p>What I would like to suggest is a concerted effort to use AI to assist linking test forms with no common examinees and no common items to allow comparing scores, trends, form difficulty, examinee ability, and so on over time in these situations. There are a variety of ways this could proceed, and obviously there would need to be multiple settings studied if there were to be any generalizations about what may and may not work in practice. However, I am going to focus on one scenario. I would like an assessment to be identified fulfilling these conditions: there are many items, the items have been or can be publicly released, and a large sample of examinees have been administered the assessment and their scored item responses and some relevant demographic/auxiliary information is available for the majority of examinees. These later variables might be related to previous achievement, such as earlier test scores or course grades, or demographics such as zip code and age. This assessment data would be used to assess how accurately an AI linkage might be, where we have a real, if contrived, result to serve as the “true” linkage.</p><p>The original assessment is then divided into two sub forms that should be similar in terms of content specifications and difficulty but should differ to the degree alternate forms comprised of items that cannot be pretested or tried out in advance typically are. For simplicity in this paper, I am going to assume odd numbered questions comprise the Odd form and even numbered questions make up the Even form. The examinee group is also divided such that the two subgroups show what would be considered a reasonable difference in composition in terms of ability, demographics, sample size, and so on, when two populations are tested in different years (e.g., 2024 secondary school graduates versus 2023 secondary school graduates).</p><p>The task for AI would be to adjust scores on the Odd form to be comparable to scores on the Even form. This could be done by estimating item characteristics for the Odd form items on the Even form scale and conducting an IRT equating to obtain comparable scores. Or creating equivalent samples from the subgroups administered the separate forms and running a random groups equating, or some other adjustment on items, test takers, or a combination of both. Because all examinees actually were “administered” both the Odd and Even forms, and because all items were administered together and could be calibrated together, there are multiple ways a criterion to evaluate the AI solutions could be created. (Personally, I would compare the AI results to many of them, as any of them could be considered reasonable operationally and if AI results were somewhere in the mix of these other results, it would seem a more reasonable evaluation than requiring the AI results to match any single criterion.)</p><p>AI would be trained using secure items and secure data, and secure equating results to learn the features of items and response patterns that correspond to different item parameter estimates, what different covariates look like in random subgroups of examinees, and so on in this particular context. That is, the training could occur on secure items and data that would not be released. AI could then incorporate item characteristics, examinee responses, examinee demographic variables in arriving at one (or multiple) adjustments to put the scores from the Odd and Even forms on the same scale. AI would likely be able to uncover patterns researchers have not been able to because of the way machine learning works. And there could be multiple ways to divide the original assessment and original sample into subgroups. If the items and the data were able to be made publicly available, I think this exercise has the potential to move us forward in trying to address this, and other, linking issues, as one could observe what characteristics of the items, as well as the response data and demographics, turned out to be important in this particular context. Plus, it would just be really cool to see how well AI might be able to address the issues of comparable scores without common items or common examinees.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"9-12"},"PeriodicalIF":2.7000,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12655","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational Measurement-Issues and Practice","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/emip.12655","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0

Abstract

Linking across test forms or pools of items is necessary to ensure scores that are reported across different administrations are comparable and lead to consistent decisions for examinees whose abilities are the same, but who were administered different items. Most of these linkages consist of equating test forms or scaling calibrated items or pools to be on the same theta scale. The typical methodology to accomplish this linking makes use of common examinees or common items, where common examinees are understood to be groups of examinees of comparable ability, whether obtained through a single group (where the same examinees are administered multiple assessments) or a random groups design, where random assignment or pseudo random assignment is done (such as spiraling the test forms, say 1, 2, 3, 4, 5, and distributing them such that every 5th examinee receives the same form). Common item methodology is usually implemented by having identical items in multiple forms and using those items to link across forms or pools. These common items may be scored or unscored in terms of whether they are treated as internal or external anchors (i.e., whether they are contributing to the examinee's score).

There are situations where it is not practical to have either common examinees nor common items. Typically, these are high-stakes settings, where the security of the assessment questions would likely be at risk if any were repeated. This would include scenarios where the entire assessment is released after administration to promote transparency. In some countries, a single form of a national test may be administered to all examinees during a single administration time. While in some cases a student who does not do as well as they had hoped may retest the following year, this may be a small sample and these students would not be considered representative of the entire body of test-takers. In addition, it is presumed they would have spent the intervening year studying for the exam, and so they could not really be considered common examinees across years and assessment forms.

Although the decisions (such as university admissions) based on the assessment scores are comparable within the year, because all examinees are administered the same set of items on the same date, it is difficult to monitor trends over time as there is no linkage between forms across years. Although the general populations may be similar (e.g., 2024 secondary school graduates versus 2023 secondary school graduates), there is no evidence that the groups are strictly equivalent across years. Similarly, comparing how examinees perform across years (e.g., highest scores, average raw score, and so on) is challenging as there is no adjustment for yearly fluctuations in form difficulty across years.

There have been variations of both common item and common examinee linking, such as using similar items, rather than identical items, including where perhaps these similar items are clones of each other, and using various types of matching techniques in an attempt to achieve common examinees by creating equivalent subgroups across forms. Cloning items or generating items from a template has had some success in terms of creating items of identical-ish difficulty. However, whether items that are clones of released items would still maintain their integrity and properties sufficiently to serve as linking items would need to be studied.

I, and many others, have been involved in several studies trying to accomplish linking for comparability where neither common items nor common examinees have been available. This short section provides a glimpse of some of that research.

Harris and Fang (2015) considered multiple options to address comparing assessment scores across years where there were no common items and no common examinees. Two of these options involved making an assumption, and the others involved making an adjustment. In the first instance, the assumption was made to treat the groups of examinees in different years as equivalent. This was solely an assumption, with no confirming evidence that the assumption was reasonable. Once the assumption was made, a random groups equating across years was conducted. The second method was to assume the test forms were built to be equivalent in difficulty, again with no evidence this was indeed the case. Because it was assumed the test forms were equivalent in difficulty, equating was not necessary (recall equating across test forms only adjusts for small differences in form difficulty; if there is no difference in difficulty, equating is unnecessary), and scores from the different forms in different years could be directly compared. The third option was to create subgroups of examinees to hopefully imitate equivalent groups of examinees being administered each of the forms, and then to conduct random groups equating. One of these subgroup options was created by using the middle 80% of the distributions of examinee scores for each year, and another used self-reported information the examinees provided, such as courses they had taken and course grades received, to create comparable subgroups across years.

Huh et al. (2016, 2017) expanded on Harris and Fang (2015), again examining alternative ways to utilize equating methodology in a context where items cannot be readministered, and the examinee groups being administered different test forms cannot be assumed to be equivalent groups. The authors referred to the methods they studied as “pseudo-equating,” as equating methodology was used, but the data assumptions associated with actual equating, such as truly randomly equivalent groups of examinees or common items, were absent. They included the two assumption-based methods Harris and Fang looked at, as well as the two ways of adjusting the examinee groups. Assuming the two forms were built to the same difficulty specifications and assuming the two samples of examinees were equivalent did not work as well as making an adjustment to try to form comparable subgroups. The two adjustments used in Harris and Fang were replicated: the middle 80% of each examinee distribution were used, the basis again being that perhaps the examinee groups would differ more in the tails than in the center, and matching group distributions based on additional information. When classical test theory equipercentile with post smoothing equating methodology was implemented, the score distribution of the subsequent year's group was adjusted based on weighting to match the initial group distribution based on variables thought to be important such as self-reported subject grades and extracurricular activities. When IRT true score equating methodology was used, comparable samples for the two groups were created for calibrations by matching the proportions within each stratum as defined by the related variables. In general, the attempts to match the groups performed better than simply making assumptions about the groups or form difficulty equivalences. Wu et al. (2017) also looked at trying to create matched sample distributions across two dispirit examinee groups, with similar results.

Kim and Walker (2021) investigated creating groups of similar ability using subgroup weighting, augmented by a small number of common items. Propensity score weighting, including variants such as coarsened exact matching, has also been used in an attempt to create equivalent groups of examinees (see, for example, Cho et al., 2024; Kapoor et al., 2024; Li et al., 2024; Woods et al., 2024 who all looked at creating equivalent samples of test takers in the context of mode studies, where one group of examinees tested on a device and the other group tested on paper).

Propensity scores are “the conditional probability of assignment to a particular treatment given a vector of observed covariates” (Rosenbaum & Rubin, 1983, p. 41). In our scenario that translates to an examinee testing on one assessment form instead of the other. One key step in implementing propensity score matching is identifying the covariates to include. In our scenario that would involve those variables which would be available for both populations and be appropriately related to the variable of interest, and to end up with equivalent samples of examinees being administered the different test forms, to allow us to appropriately conduct a random groups equating across the two forms. For example, number of math classes taken, names of the specific math classes taken, grades in the individual math classes, overall grade point average for math classes, and so on are all possible covariates that could be included. What data is collected from examinees as well as at what level of granularity are decisions that need to be made. Whether the data on variables is self-reported or provided from a trusted source (e.g., self-reported course grades versus data from a transcript) also are considerations. How many covariates to include, whether exact matching is required, how to deal with missing data, deciding what matching algorithm to use, and so on are further decisions that need to be made.

Cloning items, generating items from a template, or other ways of finding “matching items” from a subsequent form to substitute as common items with the form one wants to link to has also been studied in a variety of settings. In our scenario the issue is whether the characteristics of the items that impact the assumptions of the common item equating method being used “match” the item it is being substituted for. Item content and item position should be fairly easy to assess. However, item statistics such as IRT parameters and classical difficulty and discrimination would be computed using the responses by the test takers administered the subsequent form, and therefore not directly comparable. That is, trying to match an item with a particular p-value and point-biserial from a form administered to one group using an item administered on a different form to a different group brings us full circle. The item statistics would be comparable across the two forms if they were computed on equivalent groups of test-takers (in which case we would not need common items, we could rely on common examinees to link). If cloned items or auto-generated items were shown to have comparable statistics, they could be considered common items when placed in different forms and administered to different groups, at least in theory.

Common items and common examinees are the two vehicles used in obtaining comparable scores across different forms of an assessment. Common items can be mimicked by having items of equivalent characteristics or by having unique items that have their item parameter estimates on a common theta scale. Research on using artificial intelligence to assist in estimating item parameters already exists, and one assumes is continuing to expand (Hao, et al., 2024). Some of these initiatives have been around augmenting small sample sizes to reduce the sampling requirements for pretesting and calibrating new items as they are introduced. Examples include McCarthy et al. (2021) who used a “multi-task generalized linear model with BERT features” to provide initial estimates that are then refined as empirical data are collected; their methodology can also be used to calculate “new item difficulty estimates without piloting them first” (p. 883). Belov et al. (2024) used item pool information and trained a neural network to interpolate the relationship between item parameter estimates and response patterns. Zhang and Chen (2024) presented a “cross estimation network” consisting of a person network and an item network, finding that their approach produced accurate parameter estimates for items or persons, providing the other parameters were given (known). In a different approach, Maeda (2024) used “examinees” generated with AI to calibrate items; he determined the methodology had promise, but was not as accurate as having actual human responses to the items. AI has also been applied in the attempt to form equivalent subgroups in various settings. Monlezun et al. (2022) integrated AI and propensity scoring by using a propensity score adjustment, augmented by machine learning. Collier et al. (2022) demonstrated the application of artificial neural networks in a multilevel setting in propensity score estimation, stating “AI can be helpful in propensity score estimation because it can identify underlying patterns between treatments and confounding variables using machine learning without being explicitly programmed. Many classification algorithms in machine learning can outperform the classical methods for propensity score estimation, mainly when processing the data with many covariates…” (p. 3). The authors note that not all neural networks are created equal and that using different training methods and hyperparameters on the same data can yield different results. They also mention several computer packages and programming languages that are available to implement neural networks for propensity score estimation, including Python, R, and SAS.

What I would like to suggest is a concerted effort to use AI to assist linking test forms with no common examinees and no common items to allow comparing scores, trends, form difficulty, examinee ability, and so on over time in these situations. There are a variety of ways this could proceed, and obviously there would need to be multiple settings studied if there were to be any generalizations about what may and may not work in practice. However, I am going to focus on one scenario. I would like an assessment to be identified fulfilling these conditions: there are many items, the items have been or can be publicly released, and a large sample of examinees have been administered the assessment and their scored item responses and some relevant demographic/auxiliary information is available for the majority of examinees. These later variables might be related to previous achievement, such as earlier test scores or course grades, or demographics such as zip code and age. This assessment data would be used to assess how accurately an AI linkage might be, where we have a real, if contrived, result to serve as the “true” linkage.

The original assessment is then divided into two sub forms that should be similar in terms of content specifications and difficulty but should differ to the degree alternate forms comprised of items that cannot be pretested or tried out in advance typically are. For simplicity in this paper, I am going to assume odd numbered questions comprise the Odd form and even numbered questions make up the Even form. The examinee group is also divided such that the two subgroups show what would be considered a reasonable difference in composition in terms of ability, demographics, sample size, and so on, when two populations are tested in different years (e.g., 2024 secondary school graduates versus 2023 secondary school graduates).

The task for AI would be to adjust scores on the Odd form to be comparable to scores on the Even form. This could be done by estimating item characteristics for the Odd form items on the Even form scale and conducting an IRT equating to obtain comparable scores. Or creating equivalent samples from the subgroups administered the separate forms and running a random groups equating, or some other adjustment on items, test takers, or a combination of both. Because all examinees actually were “administered” both the Odd and Even forms, and because all items were administered together and could be calibrated together, there are multiple ways a criterion to evaluate the AI solutions could be created. (Personally, I would compare the AI results to many of them, as any of them could be considered reasonable operationally and if AI results were somewhere in the mix of these other results, it would seem a more reasonable evaluation than requiring the AI results to match any single criterion.)

AI would be trained using secure items and secure data, and secure equating results to learn the features of items and response patterns that correspond to different item parameter estimates, what different covariates look like in random subgroups of examinees, and so on in this particular context. That is, the training could occur on secure items and data that would not be released. AI could then incorporate item characteristics, examinee responses, examinee demographic variables in arriving at one (or multiple) adjustments to put the scores from the Odd and Even forms on the same scale. AI would likely be able to uncover patterns researchers have not been able to because of the way machine learning works. And there could be multiple ways to divide the original assessment and original sample into subgroups. If the items and the data were able to be made publicly available, I think this exercise has the potential to move us forward in trying to address this, and other, linking issues, as one could observe what characteristics of the items, as well as the response data and demographics, turned out to be important in this particular context. Plus, it would just be really cool to see how well AI might be able to address the issues of comparable scores without common items or common examinees.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
3.90
自引率
15.00%
发文量
47
期刊最新文献
Issue Information Editorial ITEMS Corner: Next Chapter of ITEMS Digital Module 37: Introduction to Item Response Tree (IRTree) Models Issue Cover
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1