The Challenges of Using Generative AI & Automation for Item Writing

by Elijah Schwartz, MSEd, MS, BCBA-candidate, Senior Learning Strategist, Kaplan North America | August 28, 2023

The Case For Assessment Design

Writing psychometrically valid tests is hard.

I have a textbook that I hold up in meetings to make that point: Test Equating, Scaling, and Linking, by Kolen & Brennan (2014). The authors itemize the hundreds of nap-inducing details that need to be done exactly right to make a “perfect” assessment. Save yourself $100: it takes time, money, and a painstaking commitment to the process. That’s a reason why admission exams – especially in high-stakes fields like healthcare – charge $500+ each.

The investment is certainly worth it at scale, but can we get close enough with generative AI or automatic item generation (AIG)? Even if some number of questions are guaranteed to be “hallucinations,” there are still enough “viable” questions, right?

Unfortunately, that’s not good enough.

The data shows that even a few non-test-like questions will negatively impact predictions about a student’s future performance. Dr. Lai, Kaplan’s Senior Assessment Specialist, points out that, “even worse (for that student) using low-quality items wastes their time and could be distracting from effortful practice that would truly lead to success.”

Curation vs Creation

Overall, the argument is for quality over quantity, but that quality is often hard to distinguish and appreciate. At Kaplan, that’s a difference that the brand stakes its name on.

Let’s take a scenario: we want to test whether kindergarteners can tie their shoelaces.

Specifically, we need to add practice questions that the students can complete ahead of the final shoelace exam (such high stakes!). Simply creating more questions could be as easy as copying old questions and changing the color of the shoelaces, or asking the same prompt in several subtly different ways. Boom, even with some light QA we’ve got 20 or 30 new questions in the span of a few hours. As any teacher who’s made exams this way will tell you, this can lead to questions acting almost like flashcards, with students able to recognize the answer they’re supposed to pick just by seeing variations enough times.

But we can (and should) strive to do better.

Ideally, we’d hire an expert shoe-lacer who knows the pain-points of learning this skill and can write nuanced questions to drive understanding. Then, take a look at the official kindergarten shoe lacing exam blueprint and make sure everything matches to what the student actually needs to know or perform. We’d bring in a psychometrician (or a whole team) and work on evaluating which questions are performing well.

This is curation, and it's why it might take a year or more to see 1500 new questions released for any one test-prep product that Kaplan offers.

Automatic Generation Risks

The kindergarten shoelace exam questions might seem too simplistic at first, but it’s useful to follow it through. Each handcrafted question would pinpoint exactly one lacing skill at a time; the answer options would have common mistakes to learn from; additional shoe expert(s) would review and QA; there would be beta tests, and questions would be regularly updated based on performance “in the field” over time.

That last part might seem obvious, but it outlines a very real opportunity risk.

Students who practice with non-test-like questions (even if they seem on-topic) can snowball into squandered time and needless hairsplitting. Further, Dr. Lai reminds us that, “having high-quality assessments alone still isn't enough to move the needle on learning. To really see the value of measurement, the information must be applied. It is the action following that matters – the best tests allow timely interventions and specific feedback with targeted remediation.”

It bears noting that even Kaplan’s preeminent QBanks have a companion book or prep course that can help correct misdirections, or boost good performance towards greatness.

And that brings us back to whether AI can be “good enough” for now.

Even the most advanced AI systems (as of this writing) that could get 80% or 90% of their output questions to match a test blueprint like an expert, require a person in that pipeline confirming, reviewing, and updating. And no current AI system is able to connect the various wrong answers to a most efficient path for the student, not even to a marginal degree of fluency compared to a human expert. It might still take a human to teach a human, for some things.

Where We Go From Here

Even providing Kaplan’s QBank to students – otherwise undirected by faculty – can be extremely useful. A purposeful and balanced assessment system that integrates multiple types of high-quality assessments to measure learning will still allow students to make informed decisions about their instructional next steps (Chappuis, Commodore, & Stiggins, 2017).

In a world that seems to want AI shortcuts and automatic generation plugged in to solve everything from marketing messages to legal contracts, it’s comforting to still be able to depend on the rigor and care with which some item banks are created and maintained. And, rather than force educators to read a textbook on scaling and equating exams, we want to support capable teachers toward what matters: focusing on their students; even the ones who are just learning how to tie laces!

Let the test experts do the work of untangling the tests.

Have questions about medical school or the MCAT? Learn more about how Kaplan can help with Dental prep. Looking for Optometry? We can support your USMLE or NCLEX prep.

Eli Schwartz has been a thought-leader in the Learning Experience space for nearly two decades. Having migrated from training & development, through instruction & curriculum, he presently advises senior leadership on educational best-practices. Schwartz's strongest insights often come in the form of paradigm-shifting conversations around the value of online instruction and how to best implement collaborative learning events.

See more posts by Elijah Schwartz, MSEd, MS, BCBA-candidate, Senior Learning Strategist, Kaplan North America

#artificialintelligence, #itemwriting, #testblueprinting

Additional Resources

Digital Classroom Resources

For Nursing Educators

For Bar Educators