Address the originality of student work and emerging trends in misconduct with this comprehensive solution.
Deliver and grade paper-based assessments from anywhere using this modern assessment platform.
This high-stakes plagiarism checking tool is the gold standard for academic researchers and publishers.
This robust, comprehensive plagiarism checker fits seamlessly into existing workflows.
Give feedback and grade assignments with this tool that fosters writing excellence and academic integrity.
Improve program outcomes with instant data insights from secure digital exams taken offline.
Uphold academic integrity, streamline grading and feedback, and protect your reputation with these tools.
Improve student writing, check for text similarity, and help develop original thinking skills with these tools for teachers.
Publish with confidence using the tool top researchers and publishers trust to ensure the originality of scholarly works.
Discover the Turnitin Partner Program that offers flexible solutions for integration and commercial partnerships.
Get inspired by educators who are transforming assessment into meaningful learning while maintaining integrity at its core.
Follow our progress on detection initiatives for AI writing, ChatGPT, and AI-paraphrasing
ExamSoft is the leading software for administering the U.S. bar exam and digital assessments in various fields...
Understanding the meaning and function of summative assessment helps clarify its role within education as a...
Understanding subjective and objective assessments, and the difference between the two, is central to designing...
Turnitin blog posts, delivered straight to your inbox.
When it comes to test validity, invalid or unreliable methods of assessment can reduce the chances of reaching predetermined academic or curricular goals. Poorly written assessments can even be detrimental to the overall success of a program. It is essential that exam designers use every available resource—specifically data analysis and psychometrics—to ensure the validity of their assessment outcomes.
According to Stuart Shaw and Victoria Crisp, “Measuring the traits or attributes that a student has learnt during a course is not like measuring an objective property such as length or weight; measuring educational achievement is less direct. Yet, educational outcomes can have high stakes in terms of consequences (e.g. affecting access to further education), thus the validity of assessments are highly important” (2011).
In this blog, we dive into the differences between test validity and reliability and how they can affect student learning outcomes. We also take a look at the value of data analysis, psychometrics, and the ways in which an exam designer can ensure that their test is both reliable and valid for their situation.
These terms are closely related, but distinct in meaningful ways when referring to exam efficacy. As an exam designer, it is crucial to understand the differences between reliability and validity.
First, reliability refers to how dependably or consistently a test measures a certain characteristic. For an exam or an assessment to be considered reliable, it must exhibit consistent results. A test taker can get the same score no matter how, where, or when they take it, within reason. Deviations from data patterns and anomalous results or responses could be a sign that specific items on the exam are misleading or unreliable.
Here are three types of reliability, according to The Graide Network, that can help determine if the results of an assessment are valid:
Using these three types of reliability measures can help teachers and administrators ensure that their assessments are as consistent and accurate as possible.
There are a few ways that an exam designer can help to improve and ensure test reliability, based on Fiona Middleton’s work:
Test-retest reliability, which measures the consistency of the same test over time:
Interrater reliability, which measures the consistency of the same test conducted by different people:
Parallel forms reliability, which measures the consistency of different versions of a test which are designed to be equivalent:
Improving internal consistency, which measures the consistency of the individual items of a test:
If an exam can be considered reliable, then an instructor assessing students can rest assured that the data gleaned from the exam is a trustable measure of competency. The aforementioned elements, in addition to many other practical tips to increasing reliability, are helpful as exam designers work to create a meaningful, worthwhile assessment.
Conversely, test validity refers to what characteristic the test measures and how well the test measures that characteristic. Formed by Truman Lee Kelley, Ph.D. in 1927, the concept of test validity centers on the concept that a test is valid if it measures what it claims to measure. For example, a test of physical strength should measure strength and not measure something else (like intelligence or memory). Likewise, a test that measures a medical student’s technical proficiency may not be valid for predicting their bedside manner.
Thus, validity and reliability need to go hand-in-hand for exams that not only consistently measure a specific characteristic, but also provide trustworthy data on what the exam is supposed to measure.
The validity of an assessment refers to how accurately or effectively it measures what it was designed to measure, notes the University of Northern Iowa Office of Academic Assessment. If test designers or instructors don’t consider all aspects of assessment creation — beyond the content — the validity of their exams may be compromised.
For instance, a political science test with exam items composed using complex wording or phrasing could unintentionally shift to an assessment of reading comprehension. Similarly, an art history exam that slips into a pattern of asking questions about the historical period in question without referencing art or artistic movements may not be accurately measuring course objectives. Inadvertent errors such as these can have a devastating effect on the validity of an examination.
Sean Gyll and Shelley Ragland in The Journal of Competency-Based Education suggest following these best-practice design principles to help preserve test validity:
Let’s look at each of the five steps more in depth to understand how each operates to ensure test validity.
“Taking time at the beginning to establish a clear purpose, helps to ensure that goals and priorities are more effectively met” (Gyll & Ragland, 2018). This the first, and perhaps most important, step in designing an exam. When building an exam, it is important to consider the intended use for the assessment scores. Is the exam supposed to measure content mastery or predict success?
A job/task analysis (JTA) is conducted in order to identify the knowledge, skills, abilities, attitudes, dispositions, and experiences that a professional in a particular field ought to have. For example, consider the physical or mental requirements needed to carry out the tasks of a nurse practitioner in the emergency room. “The JTA contributes to assessment validity by ensuring that the critical aspects of the field become the domains of content that the assessment measures” (Gyll & Ragland, 2018). This essential step in exam creation is conducted to accurately determine what job-related attributes an individual should possess before entering a profession.
“Typically, a panel of subject matter experts (SMEs) is assembled to write a set of assessment items. The panel is assigned to write items according to the content areas and cognitive levels specified in the test blueprint” (Gyll & Ragland, 2018). Once the intended focus of the exam, as well as the specific knowledge and skills it should assess, has been determined, it’s time to start generating exam items or questions. An item pool is the collection of test items that are used to construct individual adaptive tests for each examinee. It should meet the content specifications of the test and provide sufficient information at all levels of the ability distribution of the target population (van der Linden et al., 2006).
“Additionally, items are reviewed for sensitivity and language in order to be appropriate for a diverse student population” (Gyll & Ragland, 2018). Once the exam questions have been created, they are reviewed by a team of experts to ensure there are no design flaws. For standardized testing, review by one or several additional exam designers may be necessary. For an individual classroom instructor, an administrator or even simply a peer can offer support in reviewing. Exam items are checked for grammatical errors, technical flaws, and accuracy.
“If an item is too easy, too difficult, failing to show a difference between skilled and unskilled examinees, or even scored incorrectly, an item analysis will reveal it” (Gyll & Ragland, 2018). This essential stage of exam-building involves using data and statistical methods, such as item analysis, to check the validity of an assessment. Item analysis refers to the process of statistically analyzing assessment data to evaluate the quality and performance of your test items.
In education, an example of test validity could be a mathematics exam that does not require reading comprehension. Taking it further, one of the most effective ways to improve the quality of an assessment is examining validity through the use of data and psychometrics. ExamSoft defines psychometrics as: “Literally meaning mental measurement or analysis, psychometrics are essential statistical measures that provide exam writers and administrators with an industry-standard set of data to validate exam reliability, consistency, and quality.” Psychometrics is different from item analysis because item analysis is a process within the overall space of psychometrics that helps to develop sound examinations.
Here are the psychometrics endorsed by the assessment community for evaluating exam quality:
It is essential to note that psychometric data points are not intended to stand alone as indicators of exam validity. These statistics should be used together for context and in conjunction with the program’s goals for holistic insight into the exam and its questions. When used properly, psychometric data points can help administrators and test designers improve their assessments in the following ways:
Fiona Middleton describes the four main types of validity:
When crafting exams, instructors and exam designers should consider the above elements of validity and how to create questions that align with the course’s objectives. Moreover, instructors may want to consider item analysis in concert, which helps to inform course content and curriculum.
Ensuring that exams are both valid and reliable is the most important job of test designers. Using the most reliable and valid assessments has benefits for everyone. If an examination is not reliable, valid, or both, then it will not consistently nor accurately measure the competency of the text takers for the tasks the exam was designed to measure. Not only does that potentially harm the integrity of that program or profession, but it could also negatively affect the confidence and learning outcomes of students in the long-term. Taking meaningful steps to confirm test reliability and validity can make the difference between a flawed examination that requires review and an assessment that provides an accurate picture of whether students have mastered course content and are ready to perform in their careers.