I gave a test last Tuesday which I did in about forty-five seconds.
It covered cellular respiration, had eight questions, and caught a misconception about ATP that I probably would have missed until the unit test. This test did more for my third grade class than the review worksheet I spent an evening writing the week before.
I want to be honest about this: I was skeptical of AI-generated ratings. I teach biology and for years I have believed that writing my own questions is part of knowing my students. And I still believe that most of all. But I’ve also come to believe something else, which is that the number of low-stakes quizzes I have to give far exceeds the number I have time to write.
The research case for more quizzes
The evidence behind elicitation practice is not new, but it is stronger than most teachers realize. Roediger and Karpicke 2006 studying at the University of Washington showed that students who took practice tests retained significantly more material over time than students who spent the same period of time rereading their notes. The margins were not small. On delayed recall tests given days later, the test group significantly outperformed the restudy group.
This idea, sometimes called the testing effect, has been widely replicated ever since. A 2021 Systematic Review by Agarwal, Nunes and Blunt examined 50 classroom experiments with over 5,000 students. Fifty-seven percent of the effect sizes were medium or large. An earlier classroom study found that students scored 94 percent on tested material versus 81 percent on material they had studied but never tested, a difference that persisted months later.
What strikes me about this research is how little of it has seeped into everyday teaching practice. We talk about formative assessment in professional development sessions. We know the theory. But the day-to-day reality is that most teachers do maybe one or two low-stakes checks a week, if that. Black and William landmark review of formative assessment found effect sizes between 0.4 and 0.7, placing it above nearly every other classroom intervention that has been studied. Yet implementation gaps persist, and I think the reason is simple: doing good tests takes time that we don’t have.
The problem of time is real
I tried to maintain a question bank. I used Google Forms to build quick checks. I once even had students write questions for each other, which is a good activity but doesn’t reliably produce questions that test the right things.
The bottleneck is always the same. Writing a good multiple-choice question with plausible distractors requires serious thought. Eight of these take a minimum of half an hour to write if you want the wrong answers to reflect the students’ actual misconceptions and not the obviously stupid options. Multiply that by five preparations and the math stops working. So I end up giving fewer tests than research suggests I should. I suspect most teachers are in the same position.
Differentiation makes it worse. I have students reading at a ninth grade level and students reading at a college level in the same room. One test does not serve both groups well, and writing two versions doubles the time.
What artificial intelligence test generation actually looks like
This is where the tools made a difference for me. I started experimenting with AI test generators about a year ago, mostly out of curiosity, and have continued to use them because they really save me time.
The main idea is clear. You give the tool your source material, either by pasting text or uploading a document, and it generates questions. Multiple Choice, True/False, Short Answer. You can usually choose the shape and adjust the difficulty. Tools like AI test generator in Quizgecko let you enter a lesson plan or PDF chapter and get a full set of questions back in less than a minute. I’ve also been using Google Forms with its recent AI offerings, and I keep Anki around for spaced-repetition flashcard work with my AP students.
What surprised me was the quality of the distractors. Wrong answers are not random. They tend to reflect common misunderstandings, which is exactly what you want in formative assessment. I won’t always hit the limits, but often enough that I can start from the generated set and edit instead of building from scratch.
This shift, from writing to editing, is the real time saver. I spend five to ten minutes reviewing and correcting a test that would have taken me thirty or forty minutes to create from scratch. Over a week, it adds up.
Keeping the teacher informed
I must be clear: I do not give these tests to students without reading them first. That would be a mistake and also miss the point.
Reviewing AI-generated questions actually forces you to think about what your students need to know. When I scan a set of ten questions and delete three of them, the reasons why I delete them are informative. Maybe the question tests the vocabulary when I wanted to test an application. Maybe it’s ambiguous in a way that would confuse my English learners. These decisions are still mine and should be.
What I started doing is generating a larger set of what I needed, maybe fifteen questions, and then whittling it down to eight or ten. I choose those that address the specific learning objectives for this lesson. Sometimes I rewrite a question stem to match how we actually discussed the topic in class. Sometimes I add a question that the AI didn’t think of because I know from last year that students struggle with a certain graph.
I mostly use them as entry tickets and exit tickets. Five questions at the beginning of class to activate prior knowledge. Five at the end to check what went down. Quizgecko and similar tools are fast enough that I can generate an exit ticket during my planning period before the last class of the day, based on what I’ve noticed students struggle with earlier periods. This kind of responsive assessment was really hard to do before.
Where AI quizzes fail
They’re not perfect, and to pretend it’s disgusting would undermine everything I’ve said so far.
The most common problem I see are questions that are technically correct but pedagogically shallow. AI tends to pull directly from the source text, meaning it sometimes generates recall-level questions when I want analysis-level ones. If your source material is a textbook chapter, you will be given questions that test whether students remember facts from that chapter. You won’t always get questions that prompt students to apply these facts to a new scenario.
Theme-specific issues also arise. In biology, I’ve seen questions where AI confuses similar terms like “mitosis” and “meiosis” in contexts where the distinction matters. In one memorable instance, he generated a question about protein synthesis where all four answer choices were technically defensible depending on how you read the stem. A student would probably be fine, but I would file complaints.
Math and foreign language teachers I have spoken to report similar problems. AI can generate volume, but it doesn’t always understand the progression of difficulty in a topic. It may create a question that requires knowledge that students have not yet encountered, or test a skill at a level that is too simple to be useful.
None of this is disqualifying. It just means you review what you get. The tool gives you a first draft, not a finished product.
What does this mean for assessment practice
I think the real opportunity here is frequency, not automation. Research on retrieval practice is clear: students learn more when tested frequently and at low stakes. The obstacle has always been time. If AI tools reduce the cost of creating a test from thirty minutes to five, teachers can realistically test three or four times a week instead of once.
This is more important than whether the AI wrote a perfect question. A slightly imperfect test given on Wednesday is worth more than a perfect test you never got around to writing.
I’m not making a big claim that AI is transforming education. I do small, practical: these tools allow me to do something I already knew I needed to do but couldn’t find the hours to do. Cognitive science has been telling us for twenty years that retrieval practice works. The bottleneck has always been manufacturing. At least for me, that bottleneck is gone.
My students still groan when I hand them a test.
Some things AI can’t fix.
