Blog · 1 Jun 2023

How can you measure test validity and reliability?

Audrey Campbell

M.A. in Teaching; Senior Marketing Writer

When it comes to test validity, invalid or unreliable methods of assessment can reduce the chances of reaching predetermined academic or curricular goals. Poorly written assessments can even be detrimental to the overall success of a program. It is essential that exam designers use every available resource—specifically data analysis and psychometrics—to ensure the validity of their assessment outcomes.

According to Stuart Shaw and Victoria Crisp, “Measuring the traits or attributes that a student has learnt during a course is not like measuring an objective property such as length or weight; measuring educational achievement is less direct. Yet, educational outcomes can have high stakes in terms of consequences (e.g. affecting access to further education), thus the validity of assessments are highly important” (2011).

In this blog, we dive into the differences between test validity and reliability and how they can affect student learning outcomes. We also take a look at the value of data analysis, psychometrics, and the ways in which an exam designer can ensure that their test is both reliable and valid for their situation.

What is test validity and reliability?

These terms are closely related, but distinct in meaningful ways when referring to exam efficacy. As an exam designer, it is crucial to understand the differences between reliability and validity.

What is test reliability?

First, reliability refers to how dependably or consistently a test measures a certain characteristic. For an exam or an assessment to be considered reliable, it must exhibit consistent results. A test taker can get the same score no matter how, where, or when they take it, within reason. Deviations from data patterns and anomalous results or responses could be a sign that specific items on the exam are misleading or unreliable.

Here are three types of reliability, according to The Graide Network, that can help determine if the results of an assessment are valid:

Test-Retest Reliability measures “the replicability of results.”
- Example: A student who takes the same test twice, but at different times, should have similar results each time.
Alternate Form Reliability measures “how test scores compare across two similar assessments given in a short time frame.”
- Example: A student who takes two different versions of the same test should produce similar results each time.
Internal Consistency Reliability measures “how the actual content of an assessment works together to evaluate understanding of a concept.”
- Example: A student who is asked multiple questions that measure the same thing should give the same answer to each question.

Using these three types of reliability measures can help teachers and administrators ensure that their assessments are as consistent and accurate as possible.

Test reliability: How do you ensure test reliability?

There are a few ways that an exam designer can help to improve and ensure test reliability, based on Fiona Middleton’s work:

Test-retest reliability, which measures the consistency of the same test over time:

When designing tests or questionnaires, try to formulate questions, statements, and tasks in a way that won’t be influenced by the mood or concentration of participants.
When planning your methods of data collection, try to minimize the influence of external factors, and make sure all samples are tested under the same conditions.
Remember that changes or recall bias can be expected to occur in the participants over time, and take these into account.

Interrater reliability, which measures the consistency of the same test conducted by different people:

Clearly define your variables and the methods that will be used to measure them.
Develop detailed, objective criteria for how the variables will be rated, counted or categorized.
If multiple researchers are involved, ensure that they all have exactly the same information and training.

Parallel forms reliability, which measures the consistency of different versions of a test which are designed to be equivalent:

Ensure that all questions or test items are based on the same theory and formulated to measure the same thing.

Improving internal consistency, which measures the consistency of the individual items of a test:

Take care when devising questions or measures: those intended to reflect the same concept should be based on the same theory and carefully formulated.

If an exam can be considered reliable, then an instructor assessing students can rest assured that the data gleaned from the exam is a trustable measure of competency. The aforementioned elements, in addition to many other practical tips to increasing reliability, are helpful as exam designers work to create a meaningful, worthwhile assessment.

Learn more about Examsoft

What is test validity?

Conversely, test validity refers to what characteristic the test measures and how well the test measures that characteristic. Formed by Truman Lee Kelley, Ph.D. in 1927, the concept of test validity centers on the concept that a test is valid if it measures what it claims to measure. For example, a test of physical strength should measure strength and not measure something else (like intelligence or memory). Likewise, a test that measures a medical student’s technical proficiency may not be valid for predicting their bedside manner.

Thus, validity and reliability need to go hand-in-hand for exams that not only consistently measure a specific characteristic, but also provide trustworthy data on what the exam is supposed to measure.

Test validity: How do you ensure test validity?

The validity of an assessment refers to how accurately or effectively it measures what it was designed to measure, notes the University of Northern Iowa Office of Academic Assessment. If test designers or instructors don’t consider all aspects of assessment creation — beyond the content — the validity of their exams may be compromised.

For instance, a political science test with exam items composed using complex wording or phrasing could unintentionally shift to an assessment of reading comprehension. Similarly, an art history exam that slips into a pattern of asking questions about the historical period in question without referencing art or artistic movements may not be accurately measuring course objectives. Inadvertent errors such as these can have a devastating effect on the validity of an examination.

Sean Gyll and Shelley Ragland in The Journal of Competency-Based Education suggest following these best-practice design principles to help preserve test validity:

Establish the test purpose
Perform a job-test analysis (JTA)
Create the item pool
Review the exam items
Conduct the item analysis

Let’s look at each of the five steps more in depth to understand how each operates to ensure test validity.

1. Establish the test purpose.

“Taking time at the beginning to establish a clear purpose, helps to ensure that goals and priorities are more effectively met” (Gyll & Ragland, 2018). This the first, and perhaps most important, step in designing an exam. When building an exam, it is important to consider the intended use for the assessment scores. Is the exam supposed to measure content mastery or predict success?

2. Perform a job/task analysis (JTA).

A job/task analysis (JTA) is conducted in order to identify the knowledge, skills, abilities, attitudes, dispositions, and experiences that a professional in a particular field ought to have. For example, consider the physical or mental requirements needed to carry out the tasks of a nurse practitioner in the emergency room. “The JTA contributes to assessment validity by ensuring that the critical aspects of the field become the domains of content that the assessment measures” (Gyll & Ragland, 2018). This essential step in exam creation is conducted to accurately determine what job-related attributes an individual should possess before entering a profession.

3. Create the item pool.

“Typically, a panel of subject matter experts (SMEs) is assembled to write a set of assessment items. The panel is assigned to write items according to the content areas and cognitive levels specified in the test blueprint” (Gyll & Ragland, 2018). Once the intended focus of the exam, as well as the specific knowledge and skills it should assess, has been determined, it’s time to start generating exam items or questions. An item pool is the collection of test items that are used to construct individual adaptive tests for each examinee. It should meet the content specifications of the test and provide sufficient information at all levels of the ability distribution of the target population (van der Linden et al., 2006).

4. Review the exam items.

“Additionally, items are reviewed for sensitivity and language in order to be appropriate for a diverse student population” (Gyll & Ragland, 2018). Once the exam questions have been created, they are reviewed by a team of experts to ensure there are no design flaws. For standardized testing, review by one or several additional exam designers may be necessary. For an individual classroom instructor, an administrator or even simply a peer can offer support in reviewing. Exam items are checked for grammatical errors, technical flaws, and accuracy.

5. Conduct the item analysis.

“If an item is too easy, too difficult, failing to show a difference between skilled and unskilled examinees, or even scored incorrectly, an item analysis will reveal it” (Gyll & Ragland, 2018). This essential stage of exam-building involves using data and statistical methods, such as item analysis, to check the validity of an assessment. Item analysis refers to the process of statistically analyzing assessment data to evaluate the quality and performance of your test items.

What is an example of test validity?

In education, an example of test validity could be a mathematics exam that does not require reading comprehension. Taking it further, one of the most effective ways to improve the quality of an assessment is examining validity through the use of data and psychometrics. ExamSoft defines psychometrics as: “Literally meaning mental measurement or analysis, psychometrics are essential statistical measures that provide exam writers and administrators with an industry-standard set of data to validate exam reliability, consistency, and quality.” Psychometrics is different from item analysis because item analysis is a process within the overall space of psychometrics that helps to develop sound examinations.

Here are the psychometrics endorsed by the assessment community for evaluating exam quality:

Item Difficulty Index (p-value): Determines the overall difficulty of an exam item.
Upper Difficulty Index (Upper 27%): Determines how difficult exam items were for the top scorers on a test.
Lower Difficulty Index (Lower 27%): Determines how difficult exam items were for the lowest scorers on a test.
Discrimination Index: Provides a comparative analysis of the upper and lower 27% of examinees.
Point Bi-serial Correlation Coefficient: Measures correlation between an examinee’s answer on a specific item and their performance on the overall exam.
Kuder-Richardson Formula 20 (KR-20): Rates the overall exam based on the consistency, performance, and difficulty of all exam items.

It is essential to note that psychometric data points are not intended to stand alone as indicators of exam validity. These statistics should be used together for context and in conjunction with the program’s goals for holistic insight into the exam and its questions. When used properly, psychometric data points can help administrators and test designers improve their assessments in the following ways:

Identify questions that may be too difficult.
Identify questions that may not be difficult enough.
Avoid instances of more than one correct answer choice.
Eliminate exam items that measure the wrong learning outcomes.
Increase reliability (Test-Pretest, Alternate Form, and Internal Consistency) across the board.

What are the four types of validity?

Fiona Middleton describes the four main types of validity:

Construct validity: Does the test measure the concept that it’s intended to measure?
Content validity: Is the test fully representative of what it aims to measure?
Face validity: Does the content of the test appear to be suitable to its aims?
Criterion validity: Do the results accurately measure the concrete outcome they are designed to measure?

When crafting exams, instructors and exam designers should consider the above elements of validity and how to create questions that align with the course’s objectives. Moreover, instructors may want to consider item analysis in concert, which helps to inform course content and curriculum.

In sum: how you can measure test validity and reliability

Ensuring that exams are both valid and reliable is the most important job of test designers. Using the most reliable and valid assessments has benefits for everyone. If an examination is not reliable, valid, or both, then it will not consistently nor accurately measure the competency of the text takers for the tasks the exam was designed to measure. Not only does that potentially harm the integrity of that program or profession, but it could also negatively affect the confidence and learning outcomes of students in the long-term. Utilizing tools like ExamSoft means that instructors and institutions can be confident that their assessments are of the highest-quality and contain the highest standard of accuracy.

Taking meaningful steps to confirm test reliability and validity can make the difference between a flawed examination that requires review and an assessment that provides an accurate picture of whether students have mastered course content and are ready to perform in their careers.

Everything you need to know about assessment

Subscribe