The Length of Certification and Registration Exams
An anxious candidate will occasionally call and wonder why his or her ARRT examination
has so many questions. And then, before you know it, a genuinely concerned educator
calls to ask why the exams have so few questions. Both questions are legitimate,
and both can be answered with the same explanation.
Certification and registration exams are often developed according to the domain sampling model.
A domain can be thought of as a set of skills. Domains can be large and cover a
multitude of loosely related skills or be small and include a few tightly related
skills. For example, skill in internal medicine would represent an extensive domain,
while knowledge of cardiac anatomy seems to be a fairly compact domain.
The domain sampling model recognizes that no single exam can reasonably include
questions about every possible knowledge, skill, and ability (KSA) in a particular
domain. Instead, any one exam is comprised of a sample of the domain. So, rather
than assembling exams that consist of hundreds or even thousands of test questions,
the ARRT assembles exams that typically range from 100 to 200 questions sampled
from a larger domain.
Once the domain is defined, a strategy for sampling from the domain must be implemented.
The figure below illustrates the relationships among the different levels of a domain.
The more specificity built into the domain, the greater level of control one can
have when sampling the domain. One important feature of a sample is its size; an
exam needs to be long enough in order to reliably sample the domain. Another important
feature is representation; a representative sample of test questions leads to a
At the ARRT, the length of an exam is determined primarily by the Practice Analysis
Advisory Committee. The Advisory Committee considers several factors including the
breadth of the domain, the number of content categories, the importance of the tasks
within a content category, the relatedness of content categories, and other data
obtained from the job analysis. In addition, ARRT staff draws on common psychometric
We know from experience that a 500-question exam is overkill for all but the broadest
of domains and that a 50-question exam is too short for all but the narrowest domains.
Although there are statistical procedures for verifying the adequacy of exam length
after an exam has been administered, the process of initially establishing the length
of an exam is primarily a judgment call.
The next activity is to ensure that the domain is well represented. The results
of this activity are reflected in the content specifications for an exam. The content
specifications indicate, among other things, the number of questions for each of
the content categories, subcategories, and topics. When establishing weights for
a content category, the Advisory Committee considers the information from a variety
of sources including curriculum guidelines, recommendations from the professional
community, and the results of the practice analysis. The number of questions allocated
to a content category is strongly influenced by the breadth of the category as well
as its importance to practice.
The domain sampling model generally works well for the same reasons that opinion
polling works — a carefully specified sample really can tell you quite a bit about
the entire population. Another fact that works in favor of domain sampling is that
test-takers' behavior is reasonably consistent from one topic to the next.
If, for example, a candidate performs well on the radiation protection section of
the Radiography Exam, chances are that he or she will also do well on image production
and evaluation. Scores on those two sections of the Radiography Exam are highly
Meanwhile, scores for two seemingly unrelated content categories (such as image
production & evaluation and patient care) still exhibit a moderate correlation.
This consistency in performance on different parts of a domain bodes well for domain
Yeah, Right...But Does it Really Work?
When a person takes a test and earns a particular score, we want to be confident
that the score is a reliable indication of that person's knowledge for the entire
domain. Let's say Joe gets a score of 80 percent on a 25-question radiography exam
one week. A week goes by and Joe takes another exam covering the same domain.
Unless Joe did some serious studying or was for some reason not himself the week
before, we would expect him to get a score near 80 percent on the second exam. If
Joe scored 60 percent on a second exam and 90 percent on a third we would become
suspicious because the scores don't seem reliable.
At this point, we could blame Joe or we could blame the exam. If there were a lot
of Joes taking a lot of exams, and their scores bounced around by more than a few
points, then we should get really suspicious of the exam. Although many factors
could contribute to score instability, an incomplete sampling of the domain is one
of the more obvious ones. If you don't ask enough questions, you're not going to
get reliable scores.
There are several ways to evaluate score reliability. The most common methods are
based on the extent to which scores on tests from the same domain correlate with
one another. Calculating reliability is like giving many tests from the same domain
to many Joes. Reliability indices can fall between 0 and 1, and an index of .90
can be considered a good target for certification and registration exams.
To evaluate the impact of test length on score reliability, we can do a short study
of the ARRT Examination in Radiography. The data for this particular investigation
were obtained from the October 1998 exam, which was taken by 2,564 first-time candidates.
The Examination in Radiography consists of 200 questions. The total test reliability
index for this group was 0.926.
For this study we will see what happens when exam length is decreased by artificially
making the exam shorter. The study involves four steps: (1) randomly discard 20
questions from the exam; (2) recalculate the reliability index to see how much it
goes down; (3) see if the pass/fail decisions for any candidates have changed from
fail to pass or from pass to fail; and (4) repeat the process for successively shorter
exams by discarding 40, 60, 80, and 100 items.
Show Me The Money
The table below shows the results of our investigation. It looks like tossing out
20 to 40 questions does not have a dramatic impact on test reliability. It is not
until after exam length falls below 140 that notable changes in reliability start
|Number of Questions
|P to F
||F to P
However, the reliability index does not tell the complete story, particularly for
individual candidates. The table indicates that pass/fail decisions for some candidates
would have changed if a shorter radiography exam had been given. An exam consisting
of 180 items would change the decisions for 1 percent of the group (0.3% + 0.7%
= 1.0%), while an exam consisting of 100 items would affect over 3 percent of the
group. (Note: if this study were repeated by discarding different sets of items,
the results would have been similar but not identical.)
The changes in pass/fail decisions can be regarded as the errors that would have
resulted from shortening the radiography exam. Nonetheless, it is interesting that
the exam could be cut in half (to 100 questions) and still produce the same pass/fail
decision for almost 97 percent of the candidates.
In general a longer test is a better test. However, for most tests, there is a point
of rapidly diminishing returns after exam length reaches about 150 items. The advantages
of adding questions to an exam needs to be weighed against costs such as test development
expenses and testing time, both of which are related to examination fees. Candidate
fatigue also becomes a factor for very lengthy exams. Finally, longer exams mean
that more questions are exposed to test-takers, and for Boards like the ARRT that
reuse test questions, overexposure of questions is a security concern.
But What If...?
This previous discussion may have prompted more questions than it answered, so let's
address a few of them now.
Question: What would happen if the Radiography Exam were cut
to 20 questions or increased to 300 questions?
Answer: Although the study reported in the table above would have
allowed us to drop as many questions as we chose, there was no way that we could
actually add questions. That's OK because there is a formula for estimating the
impact of adding or deleting questions. The following graph shows the relationship
between test length and reliability for exams. The curve illustrates an important
point: the effects of adding or deleting items depends on how many items you start
with. Taking 100 items away from a 150-item exam wreaks havoc on measurement precision,
while adding 100 items to a 150-item exam makes only a minimal contribution.
Question: Is the optimal number of questions for an exam usually
around 200? Does the point of diminishing returns occur at about 150 items?
Answer: No. The ideal exam length varies according to the breadth
of the domain, the similarity of the categories within the domain, the quality of
the test questions, and other factors. For more compact domains the ideal length
would be much lower than 200, while for broader domains the ideal length may be
Question: The ARRT reports scores for individual sections of
the exams, and yet these sections may have only 30 or 40 test questions. How reliable
are these section scores?
Answer: The short answer is that section scores based on content
categories are not very reliable. For example, on the Radiography Exam
the radiation protection section (30 questions) has a reliability of around 0.70.
Although this is not too bad, it certainly is not adequate for making pass/fail
decisions. This is why the ARRT uses only total scores for determining pass/fail
status, and provides section scores only as a general guide to a candidate's strengths
and weaknesses in specific content categories. The key is to not overinterpret small
differences in section scores.
Question: Just why are some exams so long?
Answer: Sometimes for good reason, especially if the domain is
very broad. However, exams for some professions are unnecessarily long. One reason
is that a longer exam may appear to the casual observer be more valid than shorter
exams. Certain traditions seem to endure — even when such traditions may be contradicted
by empirical evidence. The bottom line is that although longer exams can provide
better content coverage and score reliability, there is a point of rapidly diminishing
returns. And it is important to strike a balance between the costs and benefits
of greater length.