Deborah J. Harris, Catherine J. Welch, Stephen B. Dunbar
{"title":"In the beginning, there was an item…","authors":"Deborah J. Harris, Catherine J. Welch, Stephen B. Dunbar","doi":"10.1111/emip.12647","DOIUrl":null,"url":null,"abstract":"<p>As educational researchers, we take scored item responses, create data sets to analyze, draw inferences from those analyses, and make decisions, about students’ educational knowledge and future success, judge how successful educational programs are, determine what to teach tomorrow, and so on. It is good to remind ourselves that the basis for all our analyses, from simple means to complex multilevel, multidimensional modeling, interpretations of those analyses, and decisions we make based on the analyses are at the core based on a test taker responding to an item. With all the emphasis on modeling, analyses, big data, machine learning, etc., we need to remember it all starts with the items we collect information on. If we get those wrong, then the results of subsequent analyses are unlikely to provide the information we are seeking.</p><p>It is true that how students and educators interact with items has changed, and continues to change. More and more of the student-item interactions are happening online, and the days when an educator had relatively easy access to the actual test items, often after test administration, are in the past. This lack of access is also true for the researchers analyzing the response data: instead of a single test booklet aligned to a data file of test taker responses, there are large pools of items, and while the researcher may know a test taker was administered, say, item #SK-65243-0273A and what the response was, they do not know what the text of the item actually was, which can make it challenging to interpret analysis results at times.</p><p>From having a test author write the items for an assessment, to contracting with content specialists to draft items, to cloning items from a template, to having large language models/artificial intelligence produce items, item development has morphed over the past and present, and will continue to morph into the future. Item tryouts for pretesting the quality and functioning of an item, including gathering data for generating item statistics to aid in forms construction and in some instances scoring, now attempt to develop algorithms that can accurately predict item characteristics, including item statistics, without gathering item data in advance of operational use (or at all). We are developing more innovative item types, and collecting more data, such as latencies, click streams, and other process data on student responses to those items.</p><p>Sometimes we are so enamored of what we can do with the data, the analyses seem distant from the actual experience: a test taker responding to an item. And this makes it challenging at times to interpret analysis results in terms of actionable steps. Our aim here is to examine the evolution of how items are developed and considered, concentrating on large-scale, K–12 educational assessments.</p><p>The <i>Standards for Educational and Psychological Testing</i> (<i>Standards</i>; American Educational Research Association [AERA], the American Psychological Association [APA], and the National Council on Measurement in Education [NCME], <span>1966, 1974</span>, <span>1985, 1999</span>, <span>2014</span>) have been the guiding principles for test developers and educational measurement specialists for decades. The <i>Standards</i> have evolved over time. They require consensus from three major organizations: APA, AERA, and NCME, which incorporate considerations of multiple viewpoints. The earliest editions seem to somewhat neglect treatment of individual items, concentrating instead on the collection of items or test forms. In keeping with <i>The Past, Present, and Future of Educational Measurement</i> theme, we use the five editions of the <i>Standards</i> to examine how the focus on items has morphed over the years, and to look ahead to the future edition of the <i>Standards</i> currently under development, and how they conceptualize issues related to ‘items’.</p><p>Our intent is to focus attention clearly on items, in all their formats, as the basis for our measurement decisions. The items test takers respond to are at the center of the data we collect, analyze, and interpret. And yet, at times the items seem very far from our focus. Graduate students in educational measurement typically have a general class on measurement early in their training, which covers basic foundational concepts such as reliability and validity, and typically, some treatment of item writing is often included, usually dealing with multiple choice items and good and bad item writing tips. Constructed response items and rubrics may also be covered. However, as students enter more advanced courses, it seems item statistics, <i>p</i>-values, point-biserials, IRT parameter estimates, are where the focus is. The actual text of items may not even be presented as “good” statistical properties, item bias and item fit statistics are discussed, and items in an assessment are retained or discarded based solely on statistical properties, ignoring the item text. There may be even more distance from the actual items as students generate data for simulation studies and get used to dealing with zeros and ones, removed from the actual items test takers respond to.</p><p>The <i>Standards</i> have evolved to reflect and address changes in the field of testing. With respect to the item development process, an expansion of the issues and content can be seen in the evolution along with a restructuring and repositioning of the areas of importance. This paper explores the evolution within the context of item development.</p><p>In the 1966 edition, topics pertaining to validity and reliability were emphasized and considered essential. Within the chapter devoted to validity, issues pertaining to content validity included standards that addressed the representativeness of items in constructing a test, the role of experts in the selection and review of items, and the match between items and test specifications were covered. For large-scale achievement tests, item writing was assumed to be completed by subject-matter specialists that “devise” and select items judged to cover the topics and processes related to the assessment being produced. Agreement among independent judgments when individual items are being selected by experts was also emphasized (Standard C3.11). However, details defining a process to follow in the item writing process were not emphasized. Similar issues pertaining to item writing were included in the 1974 edition. However, this edition also emphasized the importance of test fairness, bias and sensitivity (see Standard E12.1.2).</p><p>With respect to item writers, the documentation of the qualifications of the item writers and editors were described as desirable for achievement tests. In addition, the concept of construct validity and the alignment of test content with theoretical constructs are addressed. Standards address the practice of using experts to judge the appropriateness of items as they relate to the “universe of tasks” represented by the test. The documentation of the qualifications of the experts and the agreement of the experts in selecting items can be viewed as a precursor to many item writing and alignment activities that are in place today.</p><p>A new organizational structure was introduced in the 1985 document, grouping the technical standards together under “Technical Standards for Test Construction and Evaluation” and devoting a chapter to test development and revisions. The chapter included 25 standards that addressed test specifications, test development, fairness, item analysis and design associated with intended use and categorized each as primary, secondary, or conditional. The 1999 edition retained the 1985 structure by grouping technical measurement issues together under “Test Construction, Evaluation and Documentation” and dedicating a chapter to test development and revisions. The 27 standards included in 1999 paralleled the earlier version, but also included a more detailed introduction on test development. The introduction explored different item formats (selected-response, extended-response, portfolios, and performance items) and discussed implications for design. The importance of federal education law related to assessment and accountability was also discussed in greater detail in this edition.</p><p>To reflect changes in education between 1999 and 2014 such as the passage of the No Child Left Behind Act in 2001, the most recent version of the <i>Standards</i> include a stronger focus on accountability issues associated with educational testing. This edition highlighted fairness as one of three “Foundations” chapters, elevating it to the same level as validity and reliability. The 2014 chapter on test design and development was moved to one of six chapters under “Operations.” The introduction to this chapter was expanded and specifically addressed item development and review through the articulation of content quality, clarity, construct-irrelevance, sensitivity, and appropriateness.</p><p>The 2014 <i>Standards</i> continue to be used by testing organizations to identify and shape the processes and procedures followed to develop items. Current procedures used by testing organizations operationalize a process of item writing that begins with the design of the assessment and continues through operational testing. Standard 3.1 states that “<i>those responsible for test development, revision, and administration should design all steps of the testing process to promote valid score interpretations for intended score uses for the widest possible range of individuals and relevant subgroups in the intended population</i>.” In this way, the 2014 <i>Standards</i> incorporate the basic concepts of evidence-centered design (ECD) that have been the basis for sound development practices for decades.</p><p>The 2014 <i>Standards</i> emphasize that item development is critical for providing quality and consistency for K–12 large-scale assessments. Most states and local education agencies that are responsible for large-scale assessment programs use the current <i>Standards</i> to structure the processes and provide the necessary documentation. Although there are variations on the process, most education agencies articulate procedures for item development, item writers, and review processes.</p><p>With respect to core concepts in the <i>Standards</i> over the years of their evolution, item quality broadly defined can be thought of as the foundation of any argument for validity, reliability, and fairness. In K–12 assessment, the human element has loomed large in terms of its contribution to and evaluation of item quality. Teachers and other subject-matter experts (SMEs) write and edit items, and they serve on panels that review items drafted for large-scale assessments for appropriateness, accessibility, sensitivity, and bias, among other attributes. Alignment studies engage a wide range of stakeholders in judgmental reviews of items and their consistency with established content standards and other substantive descriptors. As we look ahead in item development, it is important to recognize the continued critical role of human judgment in applying the <i>Standards</i> as well as the qualitative evidence it provides to the profession and the public.</p><p>There seems to be little doubt after the recent explosion of interest in artificial intelligence (AI) and large-language models (LLMs) in education that the future of item development will make every attempt to leverage AI to expand the supply of materials available for large-scale assessments in K–12 (Hao et al., <span>2024</span>). Generations of SMEs have been producing independently countless test items psychometricians hope to be essentially interchangeable in terms of alignment to content specifications and construct-relevant contributions to score variance. Proponents of AI are rightfully optimistic about its potential to contribute to item development generally and in K–12 applications specifically. Although this statement is truer of some content areas than others, AI has the potential to provide important efficiencies of scale.</p><p>The idea that computers would someday contribute to item development is not new (cf. Richards, <span>1967</span>). Proponents of Automated Item Generation (AIG) for years have advanced methods to identify distinctive features of items to a degree of specificity that algorithms and/or item templates could be developed to write items on the fly, and successful small-scale applications have been developed in cognition as well as achievement (Embretson & Kingston, <span>2018</span>; Gierl & Lai, 2015; but see Attali, <span>2018</span>). As we write this in mid-2024, however, few large-scale assessments, particularly those used in K–12 accountability programs, have implemented AIG on an operational basis. What has occurred at a lightning pace in the last several years is the infusion of AI into current thinking about AIG and item development. Major providers of assessments in multiple areas of application are producing white papers and establishing advisory panels to help manage the role of AI in test and item development (Association of Test Publishers, <span>2021</span>; Pearson VUE, <span>2024</span>; Smarter Balanced Assessment Consortium, <span>2024</span>). The AI infusion is likely to continue in the testing industry at a pace which the published literature in educational measurement will be challenged to keep up with.</p><p>AI, technology, federal and state laws, and the emphasis on fairness and equity continue to have a major influence of the development of items for large scale K–12 assessments. The <i>Standards</i> have guided best practices in the development and interpretations of assessment items since the first edition in 1966 and will continue to do so through future editions. What we, as test developers, educational researchers, and practitioners need to keep front of mind for the appropriateness of our interpretations of assessment data, whether through simple raw scores or complex multilevel modeling analyses, is that the content and design of the items are the basis for all score interpretations and all decisions made from those scores. Graduate students in our field need to be, in many cases, better trained in the art and science of item development. Researchers conducting analyses on which high level decisions will be made need to remain cognizant that a test taker responding to a particular set of items is the basis on which everything they are doing rests. And as we leverage cloning, AIG, AI, and technology to boost our item pools, we need to remember that the educational achievement we are trying to assess all starts with a test taker responding to an item.</p><p>Consistent with Russell et al. (<span>2019</span>), we believe that as the field continues to develop, it also risks splintering into different camps of test developers and psychometricians. To minimize this risk, we support the design and development of graduate programs that “embrace the full life cycle of instrument development, analyses, use, and refinement holds potential to develop rounded professionals who have a fuller appreciation of the challenges encountered during each stage of instrument development and use” (Russell et al., <span>2019</span>, p. 86). This idea is consistent with the principles of ECD and is reflected in the most recent editions of the <i>Standards</i>, however it is not clear that graduate training programs in the field embrace it to the extent needed to ensure that future item development maintains the level of rigor called for by either ECD or the <i>Standards</i>. Perhaps it is time to elevate deep understanding of item development into the pantheon of critical skills necessary for defining what it means to be a psychometrician to ensure every interpretation of analyses and results takes into consideration the items on which the data were collected.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"40-45"},"PeriodicalIF":2.7000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12647","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational Measurement-Issues and Practice","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/emip.12647","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0
Abstract
As educational researchers, we take scored item responses, create data sets to analyze, draw inferences from those analyses, and make decisions, about students’ educational knowledge and future success, judge how successful educational programs are, determine what to teach tomorrow, and so on. It is good to remind ourselves that the basis for all our analyses, from simple means to complex multilevel, multidimensional modeling, interpretations of those analyses, and decisions we make based on the analyses are at the core based on a test taker responding to an item. With all the emphasis on modeling, analyses, big data, machine learning, etc., we need to remember it all starts with the items we collect information on. If we get those wrong, then the results of subsequent analyses are unlikely to provide the information we are seeking.
It is true that how students and educators interact with items has changed, and continues to change. More and more of the student-item interactions are happening online, and the days when an educator had relatively easy access to the actual test items, often after test administration, are in the past. This lack of access is also true for the researchers analyzing the response data: instead of a single test booklet aligned to a data file of test taker responses, there are large pools of items, and while the researcher may know a test taker was administered, say, item #SK-65243-0273A and what the response was, they do not know what the text of the item actually was, which can make it challenging to interpret analysis results at times.
From having a test author write the items for an assessment, to contracting with content specialists to draft items, to cloning items from a template, to having large language models/artificial intelligence produce items, item development has morphed over the past and present, and will continue to morph into the future. Item tryouts for pretesting the quality and functioning of an item, including gathering data for generating item statistics to aid in forms construction and in some instances scoring, now attempt to develop algorithms that can accurately predict item characteristics, including item statistics, without gathering item data in advance of operational use (or at all). We are developing more innovative item types, and collecting more data, such as latencies, click streams, and other process data on student responses to those items.
Sometimes we are so enamored of what we can do with the data, the analyses seem distant from the actual experience: a test taker responding to an item. And this makes it challenging at times to interpret analysis results in terms of actionable steps. Our aim here is to examine the evolution of how items are developed and considered, concentrating on large-scale, K–12 educational assessments.
The Standards for Educational and Psychological Testing (Standards; American Educational Research Association [AERA], the American Psychological Association [APA], and the National Council on Measurement in Education [NCME], 1966, 1974, 1985, 1999, 2014) have been the guiding principles for test developers and educational measurement specialists for decades. The Standards have evolved over time. They require consensus from three major organizations: APA, AERA, and NCME, which incorporate considerations of multiple viewpoints. The earliest editions seem to somewhat neglect treatment of individual items, concentrating instead on the collection of items or test forms. In keeping with The Past, Present, and Future of Educational Measurement theme, we use the five editions of the Standards to examine how the focus on items has morphed over the years, and to look ahead to the future edition of the Standards currently under development, and how they conceptualize issues related to ‘items’.
Our intent is to focus attention clearly on items, in all their formats, as the basis for our measurement decisions. The items test takers respond to are at the center of the data we collect, analyze, and interpret. And yet, at times the items seem very far from our focus. Graduate students in educational measurement typically have a general class on measurement early in their training, which covers basic foundational concepts such as reliability and validity, and typically, some treatment of item writing is often included, usually dealing with multiple choice items and good and bad item writing tips. Constructed response items and rubrics may also be covered. However, as students enter more advanced courses, it seems item statistics, p-values, point-biserials, IRT parameter estimates, are where the focus is. The actual text of items may not even be presented as “good” statistical properties, item bias and item fit statistics are discussed, and items in an assessment are retained or discarded based solely on statistical properties, ignoring the item text. There may be even more distance from the actual items as students generate data for simulation studies and get used to dealing with zeros and ones, removed from the actual items test takers respond to.
The Standards have evolved to reflect and address changes in the field of testing. With respect to the item development process, an expansion of the issues and content can be seen in the evolution along with a restructuring and repositioning of the areas of importance. This paper explores the evolution within the context of item development.
In the 1966 edition, topics pertaining to validity and reliability were emphasized and considered essential. Within the chapter devoted to validity, issues pertaining to content validity included standards that addressed the representativeness of items in constructing a test, the role of experts in the selection and review of items, and the match between items and test specifications were covered. For large-scale achievement tests, item writing was assumed to be completed by subject-matter specialists that “devise” and select items judged to cover the topics and processes related to the assessment being produced. Agreement among independent judgments when individual items are being selected by experts was also emphasized (Standard C3.11). However, details defining a process to follow in the item writing process were not emphasized. Similar issues pertaining to item writing were included in the 1974 edition. However, this edition also emphasized the importance of test fairness, bias and sensitivity (see Standard E12.1.2).
With respect to item writers, the documentation of the qualifications of the item writers and editors were described as desirable for achievement tests. In addition, the concept of construct validity and the alignment of test content with theoretical constructs are addressed. Standards address the practice of using experts to judge the appropriateness of items as they relate to the “universe of tasks” represented by the test. The documentation of the qualifications of the experts and the agreement of the experts in selecting items can be viewed as a precursor to many item writing and alignment activities that are in place today.
A new organizational structure was introduced in the 1985 document, grouping the technical standards together under “Technical Standards for Test Construction and Evaluation” and devoting a chapter to test development and revisions. The chapter included 25 standards that addressed test specifications, test development, fairness, item analysis and design associated with intended use and categorized each as primary, secondary, or conditional. The 1999 edition retained the 1985 structure by grouping technical measurement issues together under “Test Construction, Evaluation and Documentation” and dedicating a chapter to test development and revisions. The 27 standards included in 1999 paralleled the earlier version, but also included a more detailed introduction on test development. The introduction explored different item formats (selected-response, extended-response, portfolios, and performance items) and discussed implications for design. The importance of federal education law related to assessment and accountability was also discussed in greater detail in this edition.
To reflect changes in education between 1999 and 2014 such as the passage of the No Child Left Behind Act in 2001, the most recent version of the Standards include a stronger focus on accountability issues associated with educational testing. This edition highlighted fairness as one of three “Foundations” chapters, elevating it to the same level as validity and reliability. The 2014 chapter on test design and development was moved to one of six chapters under “Operations.” The introduction to this chapter was expanded and specifically addressed item development and review through the articulation of content quality, clarity, construct-irrelevance, sensitivity, and appropriateness.
The 2014 Standards continue to be used by testing organizations to identify and shape the processes and procedures followed to develop items. Current procedures used by testing organizations operationalize a process of item writing that begins with the design of the assessment and continues through operational testing. Standard 3.1 states that “those responsible for test development, revision, and administration should design all steps of the testing process to promote valid score interpretations for intended score uses for the widest possible range of individuals and relevant subgroups in the intended population.” In this way, the 2014 Standards incorporate the basic concepts of evidence-centered design (ECD) that have been the basis for sound development practices for decades.
The 2014 Standards emphasize that item development is critical for providing quality and consistency for K–12 large-scale assessments. Most states and local education agencies that are responsible for large-scale assessment programs use the current Standards to structure the processes and provide the necessary documentation. Although there are variations on the process, most education agencies articulate procedures for item development, item writers, and review processes.
With respect to core concepts in the Standards over the years of their evolution, item quality broadly defined can be thought of as the foundation of any argument for validity, reliability, and fairness. In K–12 assessment, the human element has loomed large in terms of its contribution to and evaluation of item quality. Teachers and other subject-matter experts (SMEs) write and edit items, and they serve on panels that review items drafted for large-scale assessments for appropriateness, accessibility, sensitivity, and bias, among other attributes. Alignment studies engage a wide range of stakeholders in judgmental reviews of items and their consistency with established content standards and other substantive descriptors. As we look ahead in item development, it is important to recognize the continued critical role of human judgment in applying the Standards as well as the qualitative evidence it provides to the profession and the public.
There seems to be little doubt after the recent explosion of interest in artificial intelligence (AI) and large-language models (LLMs) in education that the future of item development will make every attempt to leverage AI to expand the supply of materials available for large-scale assessments in K–12 (Hao et al., 2024). Generations of SMEs have been producing independently countless test items psychometricians hope to be essentially interchangeable in terms of alignment to content specifications and construct-relevant contributions to score variance. Proponents of AI are rightfully optimistic about its potential to contribute to item development generally and in K–12 applications specifically. Although this statement is truer of some content areas than others, AI has the potential to provide important efficiencies of scale.
The idea that computers would someday contribute to item development is not new (cf. Richards, 1967). Proponents of Automated Item Generation (AIG) for years have advanced methods to identify distinctive features of items to a degree of specificity that algorithms and/or item templates could be developed to write items on the fly, and successful small-scale applications have been developed in cognition as well as achievement (Embretson & Kingston, 2018; Gierl & Lai, 2015; but see Attali, 2018). As we write this in mid-2024, however, few large-scale assessments, particularly those used in K–12 accountability programs, have implemented AIG on an operational basis. What has occurred at a lightning pace in the last several years is the infusion of AI into current thinking about AIG and item development. Major providers of assessments in multiple areas of application are producing white papers and establishing advisory panels to help manage the role of AI in test and item development (Association of Test Publishers, 2021; Pearson VUE, 2024; Smarter Balanced Assessment Consortium, 2024). The AI infusion is likely to continue in the testing industry at a pace which the published literature in educational measurement will be challenged to keep up with.
AI, technology, federal and state laws, and the emphasis on fairness and equity continue to have a major influence of the development of items for large scale K–12 assessments. The Standards have guided best practices in the development and interpretations of assessment items since the first edition in 1966 and will continue to do so through future editions. What we, as test developers, educational researchers, and practitioners need to keep front of mind for the appropriateness of our interpretations of assessment data, whether through simple raw scores or complex multilevel modeling analyses, is that the content and design of the items are the basis for all score interpretations and all decisions made from those scores. Graduate students in our field need to be, in many cases, better trained in the art and science of item development. Researchers conducting analyses on which high level decisions will be made need to remain cognizant that a test taker responding to a particular set of items is the basis on which everything they are doing rests. And as we leverage cloning, AIG, AI, and technology to boost our item pools, we need to remember that the educational achievement we are trying to assess all starts with a test taker responding to an item.
Consistent with Russell et al. (2019), we believe that as the field continues to develop, it also risks splintering into different camps of test developers and psychometricians. To minimize this risk, we support the design and development of graduate programs that “embrace the full life cycle of instrument development, analyses, use, and refinement holds potential to develop rounded professionals who have a fuller appreciation of the challenges encountered during each stage of instrument development and use” (Russell et al., 2019, p. 86). This idea is consistent with the principles of ECD and is reflected in the most recent editions of the Standards, however it is not clear that graduate training programs in the field embrace it to the extent needed to ensure that future item development maintains the level of rigor called for by either ECD or the Standards. Perhaps it is time to elevate deep understanding of item development into the pantheon of critical skills necessary for defining what it means to be a psychometrician to ensure every interpretation of analyses and results takes into consideration the items on which the data were collected.