The Praxis® Core Academic Skills for Educators (Core) tests are used in the teacher preparation program admissions process and as part of initial teacher licensure. The purpose of this study was to estimate the relationship between scores on Praxis Core tests and Praxis Subject Assessments and to test for differential prediction by race and ethnicity. Data were drawn from operational test taker records over a period of nearly 5 years. The analysis suggests that Praxis Core tests of reading, writing, and mathematics are moderate predictors of scores from 11 high-volume Praxis Subject Assessments. There was little evidence of differential prediction across White, Black or African American, and Hispanic or Latinx test takers.
{"title":"The Relationship Between Praxis® Core Academic Skills for Educators Test and Praxis® Subject Assessment Scores: Validity Coefficients and Differential Prediction Analysis by Race/Ethnicity","authors":"Heather Buzick","doi":"10.1002/ets2.12336","DOIUrl":"10.1002/ets2.12336","url":null,"abstract":"<p>The <i>Praxis</i>® Core Academic Skills for Educators (Core) tests are used in the teacher preparation program admissions process and as part of initial teacher licensure. The purpose of this study was to estimate the relationship between scores on Praxis Core tests and Praxis Subject Assessments and to test for differential prediction by race and ethnicity. Data were drawn from operational test taker records over a period of nearly 5 years. The analysis suggests that Praxis Core tests of reading, writing, and mathematics are moderate predictors of scores from 11 high-volume Praxis Subject Assessments. There was little evidence of differential prediction across White, Black or African American, and Hispanic or Latinx test takers.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-17"},"PeriodicalIF":0.0,"publicationDate":"2021-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12336","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49544864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The extent to which a test's time limit alters a test taker's performance is known as speededness. The manifestation of speededness, or speeded behavior on a test, can be in the form of random guessing, leaving a substantial proportion of test items unanswered, or rushed test-taking behavior in general. Speeded responses do not depend solely on a test taker's ability and are therefore not appropriate for traditional item response theory. The literature on measuring the extent of speededness on a test is extensive and dates back over a half-century. Yet, simple rules of thumb for measuring speededness, dating back until at least Swineford in 1949, are still in operation—for example, 80% of the candidates reach the last item. The purpose of this research report is to provide a chronology and classification of methods for measuring speededness and to discuss ensuing research and development in measuring speededness.
{"title":"Methods for Measuring Speededness: Chronology, Classification, and Ensuing Research and Development","authors":"Dakota W. Cintron","doi":"10.1002/ets2.12337","DOIUrl":"10.1002/ets2.12337","url":null,"abstract":"<p>The extent to which a test's time limit alters a test taker's performance is known as speededness. The manifestation of speededness, or speeded behavior on a test, can be in the form of random guessing, leaving a substantial proportion of test items unanswered, or rushed test-taking behavior in general. Speeded responses do not depend solely on a test taker's ability and are therefore not appropriate for traditional item response theory. The literature on measuring the extent of speededness on a test is extensive and dates back over a half-century. Yet, simple rules of thumb for measuring speededness, dating back until at least Swineford in 1949, are still in operation—for example, 80% of the candidates reach the last item. The purpose of this research report is to provide a chronology and classification of methods for measuring speededness and to discuss ensuing research and development in measuring speededness.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-36"},"PeriodicalIF":0.0,"publicationDate":"2021-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12337","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44844635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ordinary least squares (OLS) regression provides optimal linear predictions of a dependent variable, y, given an independent variable, x, but OLS regressions are not symmetric or reversible. In order to get optimal linear predictions of x given y, a separate OLS regression in that direction would be needed. This report provides a least squares derivation of the geometric mean (GM) regression line, which is symmetric and reversible, as the line that minimizes a weighted sum of the mean squared errors for y, given x, and for x, given y. It is shown that the GM regression line is symmetric and predicts equally well (or poorly, depending on the absolute value of rxy) in both directions. The errors of prediction for the GM line are, naturally, larger for the predictions of both x and y than those for the two OLS equations, each of which is specifically optimized for prediction in one direction, but for high values of , the difference is not large. The GM line has previously been derived as a special case of principal-components analysis and gets its name from the fact that its slope is equal to the geometric mean of the slopes of the OLS regressions of y on x and x on y.
{"title":"Symmetric Least Squares Estimates of Functional Relationships","authors":"Michael T. Kane","doi":"10.1002/ets2.12331","DOIUrl":"10.1002/ets2.12331","url":null,"abstract":"<p>Ordinary least squares (OLS) regression provides optimal linear predictions of a dependent variable, <i>y</i>, given an independent variable, <i>x</i>, but OLS regressions are not symmetric or reversible. In order to get optimal linear predictions of <i>x</i> given <i>y</i>, a separate OLS regression in that direction would be needed. This report provides a least squares derivation of the geometric mean (GM) regression line, which is symmetric and reversible, as the line that minimizes a weighted sum of the mean squared errors for <i>y</i>, given <i>x</i>, and for <i>x</i>, given <i>y</i>. It is shown that the GM regression line is symmetric and predicts equally well (or poorly, depending on the absolute value of <i>r</i><sub><i>xy</i></sub>) in both directions. The errors of prediction for the GM line are, naturally, larger for the predictions of both <i>x</i> and <i>y</i> than those for the two OLS equations, each of which is specifically optimized for prediction in one direction, but for high values of , the difference is not large. The GM line has previously been derived as a special case of principal-components analysis and gets its name from the fact that its slope is equal to the geometric mean of the slopes of the OLS regressions of <i>y</i> on <i>x</i> and <i>x</i> on <i>y</i>.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-14"},"PeriodicalIF":0.0,"publicationDate":"2021-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12331","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44752025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathan Schmidgall, Jaime Cid, Elizabeth Carter Grissom, Lucy Li
The redesigned TOEIC Bridge® tests were designed to evaluate test takers' English listening, reading, speaking, and writing skills in the context of everyday adult life. In this paper, we summarize the initial validity argument that supports the use of test scores for the purpose of selection, placement, and evaluation of a test taker's English skills. The validity argument consists of four major claims that provide a coherent narrative about the measurement quality and intended uses of test scores. Each major claim in the validity argument is supported by more specific claims and a summary of supporting evidence. By considering the claims and supporting evidence presented in the validity argument, readers should be able to better evaluate whether the TOEIC Bridge tests are appropriate for their situation.
{"title":"Making the Case for the Quality and Use of a New Language Proficiency Assessment: Validity Argument for the Redesigned TOEIC Bridge® Tests","authors":"Jonathan Schmidgall, Jaime Cid, Elizabeth Carter Grissom, Lucy Li","doi":"10.1002/ets2.12335","DOIUrl":"10.1002/ets2.12335","url":null,"abstract":"<p>The redesigned <i>TOEIC Bridge</i>® tests were designed to evaluate test takers' English listening, reading, speaking, and writing skills in the context of everyday adult life. In this paper, we summarize the initial validity argument that supports the use of test scores for the purpose of selection, placement, and evaluation of a test taker's English skills. The validity argument consists of four major claims that provide a coherent narrative about the measurement quality and intended uses of test scores. Each major claim in the validity argument is supported by more specific claims and a summary of supporting evidence. By considering the claims and supporting evidence presented in the validity argument, readers should be able to better evaluate whether the TOEIC Bridge tests are appropriate for their situation.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-22"},"PeriodicalIF":0.0,"publicationDate":"2021-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12335","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44545950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryan Whorton, Debby Almonte, Darby Steiger, Cynthia Robins, Christopher Gentile, Jonas Bertling
Social changes have resulted in an increase of students living in households that do not include both a mother and a father, reducing the efficacy of common survey questionnaire approaches to measuring student socioeconomic status (SES). This paper presents two studies conducted to develop and test a new, more inclusive set of student SES items appropriate for students from a range of household types. In the first study, we held group interviews with 57 students in Grades 4, 8, and 12 who lived in four nontraditional household types. The study goal was, first, to understand how students thought about their household members and learn what they knew about the educational background and employment status of their caregivers and, second, to develop draft items based on these findings. In the second study, we held 51 individual cognitive interviews with a similar sample to evaluate draft item clarity and function. We found that although students may live with a broad range of family members and other adults, they understood the term caregiver to refer to a person who provides resources and support. Students found it easier to answer items when the items included the titles of their caregivers. Our results demonstrate that a customizable approach to measuring student SES allows more students to report information about their caregivers than the current standard of asking about mothers and fathers. We provide recommendations for student SES measurement and potential next steps for research on this topic.
{"title":"Beyond Nuclear Families: Development of Inclusive Student Socioeconomic Status Survey Questions","authors":"Ryan Whorton, Debby Almonte, Darby Steiger, Cynthia Robins, Christopher Gentile, Jonas Bertling","doi":"10.1002/ets2.12332","DOIUrl":"10.1002/ets2.12332","url":null,"abstract":"<p>Social changes have resulted in an increase of students living in households that do not include both a mother and a father, reducing the efficacy of common survey questionnaire approaches to measuring student socioeconomic status (SES). This paper presents two studies conducted to develop and test a new, more inclusive set of student SES items appropriate for students from a range of household types. In the first study, we held group interviews with 57 students in Grades 4, 8, and 12 who lived in four nontraditional household types. The study goal was, first, to understand how students thought about their household members and learn what they knew about the educational background and employment status of their caregivers and, second, to develop draft items based on these findings. In the second study, we held 51 individual cognitive interviews with a similar sample to evaluate draft item clarity and function. We found that although students may live with a broad range of family members and other adults, they understood the term <i>caregiver</i> to refer to a person who provides resources and support. Students found it easier to answer items when the items included the titles of their caregivers. Our results demonstrate that a customizable approach to measuring student SES allows more students to report information about their caregivers than the current standard of asking about mothers and fathers. We provide recommendations for student SES measurement and potential next steps for research on this topic.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-25"},"PeriodicalIF":0.0,"publicationDate":"2021-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12332","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47390457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Irshat Madyarov, Vahe Movsisyan, Habet Madoyan, Irena Galikyan, Rubina Gasparyan
The TOEFL Junior® Standard test is a tool for measuring the English language skills of students ages 11+ who learn English as an additional language. It is a paper-based multiple-choice test and measures proficiency in three sections: listening, form and meaning, and reading. To date, empirical evidence provides some support for the construct validity of the TOEFL Junior Standard test as a measure of progress. Although this evidence is based on test scores from multiple countries with diverse instructional environments, it does not account for students' instructional experiences. The present paper aims to provide additional evidence by examining the TOEFL Junior Standard test as a progress measure within the same instructional setting. The study took place in an after-school English program in Armenia, a non-English-speaking country. A total of 154 adolescents took the TOEFL Junior Standard test three times with different test forms at the intervals of 10 and then 20 instructional weeks (a total of 30 weeks). The difference in differences (DID) analysis shows that TOEFL Junior is sensitive to learning gains within 20 instructional hours per 10 weeks among A1–A2 level learners, according to the Common European Frame of Reference (CEFR) scale. However, the data did not provide support for this sensitivity among B1–B2 level learners even though their instructional time was twice as long. Although this methodology offers an improved control over the students' instructional experiences, it also delimits the results to a specific after-school program and comes with a set of other limitations.
{"title":"New Validity Evidence on the TOEFL Junior® Standard Test as a Measure of Progress","authors":"Irshat Madyarov, Vahe Movsisyan, Habet Madoyan, Irena Galikyan, Rubina Gasparyan","doi":"10.1002/ets2.12334","DOIUrl":"10.1002/ets2.12334","url":null,"abstract":"<p>The <i>TOEFL Junior®</i> Standard test is a tool for measuring the English language skills of students ages 11+ who learn English as an additional language. It is a paper-based multiple-choice test and measures proficiency in three sections: listening, form and meaning, and reading. To date, empirical evidence provides some support for the construct validity of the TOEFL Junior Standard test as a measure of progress. Although this evidence is based on test scores from multiple countries with diverse instructional environments, it does not account for students' instructional experiences. The present paper aims to provide additional evidence by examining the TOEFL Junior Standard test as a progress measure within the same instructional setting. The study took place in an after-school English program in Armenia, a non-English-speaking country. A total of 154 adolescents took the TOEFL Junior Standard test three times with different test forms at the intervals of 10 and then 20 instructional weeks (a total of 30 weeks). The difference in differences (DID) analysis shows that TOEFL Junior is sensitive to learning gains within 20 instructional hours per 10 weeks among A1–A2 level learners, according to the Common European Frame of Reference (CEFR) scale. However, the data did not provide support for this sensitivity among B1–B2 level learners even though their instructional time was twice as long. Although this methodology offers an improved control over the students' instructional experiences, it also delimits the results to a specific after-school program and comes with a set of other limitations.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-13"},"PeriodicalIF":0.0,"publicationDate":"2021-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12334","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49186382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steven Holtzman, Tamara Minott, Nimmi Devasia, Dessi Kirova, David Klieger
Gathering a diverse student body is important for institutions of higher education (IHEs) at the graduate/professional level. However, it is impossible to select a diverse student body from a homogenous group of candidates. Thus the aim of this study is to discover the extent to which diversity goals in admissions are precluded by the lack of diversity in the applicant pool. To explore this, the proportion of score reports sent to the 150 largest graduate/professional schools and sent for each major was compared, from each gender, race, and socioeconomic status (SES) grouping, to proportions from the overall applicant pool of graduate/professional students. Additionally, differences in the distance graduate/professional school applicants are willing to consider traveling by gender, race/ethnicity, and SES were investigated. Results show that differences exist in the gender, race, and SES distributions for score reports sent to different schools and for different majors as well as in the distance an applicant is willing to consider traveling for graduate/professional school. The patterns found for gender, racial, and socioeconomic diversity provide possibilities for researchers to work further together with graduate/professional schools to tackle the important challenge of increasing diversity in graduate/professional education.
{"title":"Exploring Diversity in Graduate and Professional School Applications","authors":"Steven Holtzman, Tamara Minott, Nimmi Devasia, Dessi Kirova, David Klieger","doi":"10.1002/ets2.12330","DOIUrl":"10.1002/ets2.12330","url":null,"abstract":"<p>Gathering a diverse student body is important for institutions of higher education (IHEs) at the graduate/professional level. However, it is impossible to select a diverse student body from a homogenous group of candidates. Thus the aim of this study is to discover the extent to which diversity goals in admissions are precluded by the lack of diversity in the applicant pool. To explore this, the proportion of score reports sent to the 150 largest graduate/professional schools and sent for each major was compared, from each gender, race, and socioeconomic status (SES) grouping, to proportions from the overall applicant pool of graduate/professional students. Additionally, differences in the distance graduate/professional school applicants are willing to consider traveling by gender, race/ethnicity, and SES were investigated. Results show that differences exist in the gender, race, and SES distributions for score reports sent to different schools and for different majors as well as in the distance an applicant is willing to consider traveling for graduate/professional school. The patterns found for gender, racial, and socioeconomic diversity provide possibilities for researchers to work further together with graduate/professional schools to tackle the important challenge of increasing diversity in graduate/professional education.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-19"},"PeriodicalIF":0.0,"publicationDate":"2021-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12330","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47624551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kristopher Kyle, Ann Tai Choe, Masaki Eguchi, Geoff LaFlair, Nicole Ziegler
A key piece of a validity argument for a language assessment tool is clear overlap between assessment tasks and the target language use (TLU) domain (i.e., the domain description inference). The TOEFL 2000 Spoken and Written Academic Language (T2K-SWAL) corpus, which represents a variety of academic registers and disciplines in traditional learning environments (e.g., lectures, office hours, textbooks, course packs), has served as an important foundation for the TOEFL iBT® test's domain description inference for more than 15 years. There are, however, signs that the characteristics of the registers that students encounter may be changing. Increasingly, typical university courses include technology-mediated learning environments (TMLEs), such as those represented by course management software and other online educational tools. To ensure that the characteristics of TOEFL iBT test tasks continue to align with the TLU domain, it is important to analyze the registers that are typically encountered in TMLEs. In this study, we address this issue by collecting a relatively large (4.5 million words) corpus of spoken and written TMLE registers across the six primary disciplines represented in T2K-SWAL. This corpus was subsequently tagged for a wide variety of linguistic features, and a multidimensional analysis was conducted to compare and contrast written and spoken language in TMLE and T2K-SWAL. The results indicate that although some similarities exist across spoken and written texts in traditional learning environments and TMLEs, language use also differs across learning environments (and modes) with regard to key linguistic dimensions.
{"title":"A Comparison of Spoken and Written Language Use in Traditional and Technology-Mediated Learning Environments","authors":"Kristopher Kyle, Ann Tai Choe, Masaki Eguchi, Geoff LaFlair, Nicole Ziegler","doi":"10.1002/ets2.12329","DOIUrl":"10.1002/ets2.12329","url":null,"abstract":"<p>A key piece of a validity argument for a language assessment tool is clear overlap between assessment tasks and the target language use (TLU) domain (i.e., the domain description inference). The TOEFL 2000 Spoken and Written Academic Language (T2K-SWAL) corpus, which represents a variety of academic registers and disciplines in traditional learning environments (e.g., lectures, office hours, textbooks, course packs), has served as an important foundation for the <i>TOEFL iBT</i>® test's domain description inference for more than 15 years. There are, however, signs that the characteristics of the registers that students encounter may be changing. Increasingly, typical university courses include technology-mediated learning environments (TMLEs), such as those represented by course management software and other online educational tools. To ensure that the characteristics of TOEFL iBT test tasks continue to align with the TLU domain, it is important to analyze the registers that are typically encountered in TMLEs. In this study, we address this issue by collecting a relatively large (4.5 million words) corpus of spoken and written TMLE registers across the six primary disciplines represented in T2K-SWAL. This corpus was subsequently tagged for a wide variety of linguistic features, and a multidimensional analysis was conducted to compare and contrast written and spoken language in TMLE and T2K-SWAL. The results indicate that although some similarities exist across spoken and written texts in traditional learning environments and TMLEs, language use also differs across learning environments (and modes) with regard to key linguistic dimensions.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-29"},"PeriodicalIF":0.0,"publicationDate":"2021-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12329","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46456568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Equating the scores from different forms of a test requires collecting data that link the forms. Problems arise when the test forms to be linked are given to groups that are not equivalent and the forms share no common items by which to measure or adjust for this group nonequivalence. We compared three approaches to adjusting for group nonequivalence in a situation where not only is randomization questionable, but the number of common items is small. Group adjustment through either subgroup weighting, a weak anchor, or a mix of both was evaluated in terms of linking accuracy using a resampling approach. We used data from a single test form to create two research forms for which the equating relationship was known. The results showed that both subgroup weighting and weak anchor approaches produced nearly equivalent linking results when group equivalence was not met. Direct (random groups) linking methods produced the least accurate result due to nontrivial bias. Use of subgroup weighting and linking using the anchor test only marginally improved linking accuracy compared to using the weak anchor alone when the degree of group nonequivalence was small.
{"title":"Comparisons Among Approaches to Link Tests Using Random Samples Selected Under Suboptimal Conditions","authors":"Sooyeon Kim, Michael E. Walker","doi":"10.1002/ets2.12328","DOIUrl":"10.1002/ets2.12328","url":null,"abstract":"<p>Equating the scores from different forms of a test requires collecting data that link the forms. Problems arise when the test forms to be linked are given to groups that are not equivalent and the forms share no common items by which to measure or adjust for this group nonequivalence. We compared three approaches to adjusting for group nonequivalence in a situation where not only is randomization questionable, but the number of common items is small. Group adjustment through either subgroup weighting, a weak anchor, or a mix of both was evaluated in terms of linking accuracy using a resampling approach. We used data from a single test form to create two research forms for which the equating relationship was known. The results showed that both subgroup weighting and weak anchor approaches produced nearly equivalent linking results when group equivalence was not met. Direct (random groups) linking methods produced the least accurate result due to nontrivial bias. Use of subgroup weighting and linking using the anchor test only marginally improved linking accuracy compared to using the weak anchor alone when the degree of group nonequivalence was small.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-20"},"PeriodicalIF":0.0,"publicationDate":"2021-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12328","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42240257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biometrics are physical or behavioral human characteristics that can be used to identify a person. It is widely known that keystroke or typing dynamics for short, fixed texts (e.g., passwords) could serve as a behavioral biometric. In this study, we investigate whether keystroke data from essay responses can lead to a reliable biometric measure, with implications for test security and the monitoring of writing fluency and style changes. Based on keystroke data collected from a high-stakes writing testing setting, we established a preliminary biometric benchmark for detecting test-taker identity by using features extracted from their writing process logs. We report a benchmark keystroke biometric accuracy of equal error rate of 4.7% for identifying same versus different individuals on an essay task. In particular, we show that the inclusion of writing process features (e.g., features designed to describe the writing process) in addition to the widely used typing-timing features (e.g., features based on the time intervals between two-letter key sequences) improves the accuracy of the keystroke biometrics. The proposed keystroke biometrics can have important implications for the writing assessments administered through the remotely proctored tests that have been widely adopted during the COVID pandemic.
{"title":"Benchmark Keystroke Biometrics Accuracy From High-Stakes Writing Tasks","authors":"Ikkyu Choi, Jiangang Hao, Paul Deane, Mo Zhang","doi":"10.1002/ets2.12326","DOIUrl":"10.1002/ets2.12326","url":null,"abstract":"<p><i>Biometrics</i> are physical or behavioral human characteristics that can be used to identify a person. It is widely known that keystroke or typing dynamics for short, fixed texts (e.g., passwords) could serve as a behavioral biometric. In this study, we investigate whether keystroke data from essay responses can lead to a reliable biometric measure, with implications for test security and the monitoring of writing fluency and style changes. Based on keystroke data collected from a high-stakes writing testing setting, we established a preliminary biometric benchmark for detecting test-taker identity by using features extracted from their writing process logs. We report a benchmark keystroke biometric accuracy of equal error rate of 4.7% for identifying same versus different individuals on an essay task. In particular, we show that the inclusion of writing process features (e.g., features designed to describe the writing process) in addition to the widely used typing-timing features (e.g., features based on the time intervals between two-letter key sequences) improves the accuracy of the keystroke biometrics. The proposed keystroke biometrics can have important implications for the writing assessments administered through the remotely proctored tests that have been widely adopted during the COVID pandemic.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-13"},"PeriodicalIF":0.0,"publicationDate":"2021-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12326","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48093103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}