Skip Navigation LinksHome >> Examination >> Exam Length

The Length of Certification and Registration Exams

An anxious candidate will occasionally call and wonder why his or her ARRT examination has so many questions. And then, before you know it, a genuinely concerned educator calls to ask why the exams have so few questions. Both questions are legitimate, and both can be answered with the same explanation.

Certification and registration exams are often developed according to the domain sampling model. A domain can be thought of as a set of skills. Domains can be large and cover a multitude of loosely related skills or be small and include a few tightly related skills. For example, skill in internal medicine would represent an extensive domain, while knowledge of cardiac anatomy seems to be a fairly compact domain.

The Domain...What?

The domain sampling model recognizes that no single exam can reasonably include questions about every possible knowledge, skill, and ability (KSA) in a particular domain. Instead, any one exam is comprised of a sample of the domain. So, rather than assembling exams that consist of hundreds or even thousands of test questions, the ARRT assembles exams that typically range from 100 to 200 questions sampled from a larger domain.

Once the domain is defined, a strategy for sampling from the domain must be implemented. The figure below illustrates the relationships among the different levels of a domain. The more specificity built into the domain, the greater level of control one can have when sampling the domain. One important feature of a sample is its size; an exam needs to be long enough in order to reliably sample the domain. Another important feature is representation; a representative sample of test questions leads to a balanced exam.

Representation

At the ARRT, the length of an exam is determined primarily by the Practice Analysis Advisory Committee. The Advisory Committee considers several factors including the breadth of the domain, the number of content categories, the importance of the tasks within a content category, the relatedness of content categories, and other data obtained from the job analysis. In addition, ARRT staff draws on common psychometric practices.

We know from experience that a 500-question exam is overkill for all but the broadest of domains and that a 50-question exam is too short for all but the narrowest domains. Although there are statistical procedures for verifying the adequacy of exam length after an exam has been administered, the process of initially establishing the length of an exam is primarily a judgment call.

The next activity is to ensure that the domain is well represented. The results of this activity are reflected in the content specifications for an exam. The content specifications indicate, among other things, the number of questions for each of the content categories, subcategories, and topics. When establishing weights for a content category, the Advisory Committee considers the information from a variety of sources including curriculum guidelines, recommendations from the professional community, and the results of the practice analysis. The number of questions allocated to a content category is strongly influenced by the breadth of the category as well as its importance to practice.

The domain sampling model generally works well for the same reasons that opinion polling works — a carefully specified sample really can tell you quite a bit about the entire population. Another fact that works in favor of domain sampling is that test-takers' behavior is reasonably consistent from one topic to the next.

If, for example, a candidate performs well on the radiation protection section of the Radiography Exam, chances are that he or she will also do well on image production and evaluation. Scores on those two sections of the Radiography Exam are highly correlated.

Meanwhile, scores for two seemingly unrelated content categories (such as image production & evaluation and patient care) still exhibit a moderate correlation. This consistency in performance on different parts of a domain bodes well for domain sampling.

Yeah, Right...But Does it Really Work?

When a person takes a test and earns a particular score, we want to be confident that the score is a reliable indication of that person's knowledge for the entire domain. Let's say Joe gets a score of 80 percent on a 25-question radiography exam one week. A week goes by and Joe takes another exam covering the same domain.

Unless Joe did some serious studying or was for some reason not himself the week before, we would expect him to get a score near 80 percent on the second exam. If Joe scored 60 percent on a second exam and 90 percent on a third we would become suspicious because the scores don't seem reliable.

At this point, we could blame Joe or we could blame the exam. If there were a lot of Joes taking a lot of exams, and their scores bounced around by more than a few points, then we should get really suspicious of the exam. Although many factors could contribute to score instability, an incomplete sampling of the domain is one of the more obvious ones. If you don't ask enough questions, you're not going to get reliable scores.

There are several ways to evaluate score reliability. The most common methods are based on the extent to which scores on tests from the same domain correlate with one another. Calculating reliability is like giving many tests from the same domain to many Joes. Reliability indices can fall between 0 and 1, and an index of .90 can be considered a good target for certification and registration exams.

To evaluate the impact of test length on score reliability, we can do a short study of the ARRT Examination in Radiography. The data for this particular investigation were obtained from the October 1998 exam, which was taken by 2,564 first-time candidates. The Examination in Radiography consists of 200 questions. The total test reliability index for this group was 0.926.

For this study we will see what happens when exam length is decreased by artificially making the exam shorter. The study involves four steps: (1) randomly discard 20 questions from the exam; (2) recalculate the reliability index to see how much it goes down; (3) see if the pass/fail decisions for any candidates have changed from fail to pass or from pass to fail; and (4) repeat the process for successively shorter exams by discarding 40, 60, 80, and 100 items.

Show Me The Money

The table below shows the results of our investigation. It looks like tossing out 20 to 40 questions does not have a dramatic impact on test reliability. It is not until after exam length falls below 140 that notable changes in reliability start to occur.

Number of Questions Reliability Index Decision Changes
P to F F to P
200 0.926    
180 0.920 0.3% 0.7%
160 0.914 0.5% 0.9%
140 0.903 0.8% 1.3%
120 0.888 1.1% 1.3%
100 0.864 1.1% 2.3%

However, the reliability index does not tell the complete story, particularly for individual candidates. The table indicates that pass/fail decisions for some candidates would have changed if a shorter radiography exam had been given. An exam consisting of 180 items would change the decisions for 1 percent of the group (0.3% + 0.7% = 1.0%), while an exam consisting of 100 items would affect over 3 percent of the group. (Note: if this study were repeated by discarding different sets of items, the results would have been similar but not identical.)

The changes in pass/fail decisions can be regarded as the errors that would have resulted from shortening the radiography exam. Nonetheless, it is interesting that the exam could be cut in half (to 100 questions) and still produce the same pass/fail decision for almost 97 percent of the candidates.

In general a longer test is a better test. However, for most tests, there is a point of rapidly diminishing returns after exam length reaches about 150 items. The advantages of adding questions to an exam needs to be weighed against costs such as test development expenses and testing time, both of which are related to examination fees. Candidate fatigue also becomes a factor for very lengthy exams. Finally, longer exams mean that more questions are exposed to test-takers, and for Boards like the ARRT that reuse test questions, overexposure of questions is a security concern.

But What If...?

This previous discussion may have prompted more questions than it answered, so let's address a few of them now.

Question: What would happen if the Radiography Exam were cut to 20 questions or increased to 300 questions?

Answer: Although the study reported in the table above would have allowed us to drop as many questions as we chose, there was no way that we could actually add questions. That's OK because there is a formula for estimating the impact of adding or deleting questions. The following graph shows the relationship between test length and reliability for exams. The curve illustrates an important point: the effects of adding or deleting items depends on how many items you start with. Taking 100 items away from a 150-item exam wreaks havoc on measurement precision, while adding 100 items to a 150-item exam makes only a minimal contribution.

Score Reliability vs. Number of Questions

Question: Is the optimal number of questions for an exam usually around 200? Does the point of diminishing returns occur at about 150 items?

Answer: No. The ideal exam length varies according to the breadth of the domain, the similarity of the categories within the domain, the quality of the test questions, and other factors. For more compact domains the ideal length would be much lower than 200, while for broader domains the ideal length may be greater.

Question: The ARRT reports scores for individual sections of the exams, and yet these sections may have only 30 or 40 test questions. How reliable are these section scores?

Answer: The short answer is that section scores based on content categories are not very reliable. For example, on the Radiography Exam the radiation protection section (30 questions) has a reliability of around 0.70. Although this is not too bad, it certainly is not adequate for making pass/fail decisions. This is why the ARRT uses only total scores for determining pass/fail status, and provides section scores only as a general guide to a candidate's strengths and weaknesses in specific content categories. The key is to not overinterpret small differences in section scores.

Question: Just why are some exams so long?

Answer: Sometimes for good reason, especially if the domain is very broad. However, exams for some professions are unnecessarily long. One reason is that a longer exam may appear to the casual observer be more valid than shorter exams. Certain traditions seem to endure — even when such traditions may be contradicted by empirical evidence. The bottom line is that although longer exams can provide better content coverage and score reliability, there is a point of rapidly diminishing returns. And it is important to strike a balance between the costs and benefits of greater length.

Home | About ARRT | Contact ARRT | Jobs at ARRT | Terms of Use | Privacy Policy | Site Map
Copyright The American Registry of Radiologic Technologists® Follow ARRT on Facebook
 
Close

Your session is about to expire due to an extended period of inactivity. For your protection, you will be logged out automatically unless you continue to interact with the website.

Do you want to keep your session alive?.