2025
Evaluating Large Language Models as AP Essay Scorers
Introduction — by the Teacher
This year, students tested the potential of several leading large language models (LLMs) to do the work of Advanced Placement (AP) U.S. History essay graders. The models were provided with College Board scoring guidelines and were prompted to score sample AP U.S. History exam responses. For each essay, the models made decisions on awarding points for thesis, context, evidence, reasoning, and analysis, and also provided scoring rationales. These grading decisions and the rationales were then compared to College Board scoring commentary. Alignment between the LLMs and the College Board commentary was measured in two ways: (1) how often the models matched College Board point allocations and (2) how closely the rationales conveyed meaning similar to College Board rationales. More about the methodology and how the models performed appears below.
Note: Gemini was omitted from this study because, at the beginning of testing, our school's Gemini for Education account did not allow for uploading documents, meaning the College Board scoring rubric could not be processed.
Methodology
We used the free version of the LLMs. Each model received the same instructions (see prompts, right) and the same sets of essays. Each model graded a total of 18 essays, which, along with College Board scoring rationales, were obtained from the College Board website. The essays used were actual student responses to long essay questions (LEQs) from the 2023 and 2024 AP U.S. History Exams. Specifically, the essays were part of PDF publications titled “AP United States History Sample Student Responses and Scoring Commentary Set 1” (2023, LEQs 2-4) and “AP United States History Sample Student Responses and Scoring Commentary Set 2” (2024, LEQs 2-4). Each of the 6 essay sets contained responses that College Board readers scored in the high, middle, and low ranges, based on a scoring range of 0-6 outlined in the College Board’s LEQ rubric. The rubric appears on pages 520-521 of the AP U.S. History Course and Exam Description.
In order to create an overall ranking system of LLM performance
(see graph, right), we designed a measurement for rating the models’ rubric scoring based on a scale from 0 to 1, or a percentage-correct score. This was done for each essay and was accomplished by calculating the absolute difference between each LLM's score and the College Board score and then dividing by six, the total points on the LEQ rubric. While this method compares the total rubric scores, it does not account for variations in the scoring of individual rubric components. For each essay, we therefore also measured the extent to which each part of the rubric matched. This measurement is also based on a scale from 0 to 1 and appears on the right-hand side of each spreadsheet (see spreadsheet, below). This was accomplished by calculating the absolute difference between each LLM's score and the College Board score for each part of the rubric and then dividing the total by six.
After looking at each LLM's rubric scoring accuracy, we evaluated the alignment of their written rationales against College Board scoring rationales. To do this, we used Tilores.io's free cosine similarity calculator, a tool that shows how closely two texts are aligned by converting their words and phrases into numbers. The calculator provides a similarity percentage score, which we converted to a percentage of 1 in order to be consistent with our rubric analyses. Read more about cosine similarity. — by Greyson, Connor, and Hadleigh; graph by Greyson
Rubric Scoring
Thesis. In terms of matching the scoring of College Board readers, we found that some LLMs came closer than others but all models demonstrated some shortcomings. In general, the models performed strongest when scoring the thesis part of the rubric. Concerning the thesis, both ChatGPT and Perplexity had a 94.4% match rate, making a mistake on only one of 18 essays. While the overall match rate concerning thesis for all LLMs was 84.4%, it was surprising to see the Co-Pilot model fall below that at 77.8%. In a separate measure that looked at how closely the models aligned with the overall College Board scoring on each essay, Co-Pilot achieved a match rate of 86.1%, trailing only ChatGPT (89.8%). Thus, while Co-Pilot tended to perform well more broadly, it clearly fell short on the thesis point, missing 4 out of 18 times. This, it must be assumed, would be below the performance of human readers trained by Advanced Placement scoring professionals. — by Hadleigh; graph by Gabby and Aria
Context. When it came to scoring the context point, the LLMs were largely split with some of the models performing well, such as Claude, ChatGPT and Grok, while the other models struggled. For all of the models, the average match rate for context was 74.0%. Claude performed the best scoring 88.9%, correctly scoring 16 out of 18 essays. Perplexity performed the worst, only scoring 55.5% of context points correctly. The context part of the rubric holds a value of 1 point. When judging this point, the models needed to evaluate whether the response was situated within broader historical events or developments—the context, in other words. But because the rubric point is not awarded for merely a phrase or a reference, and instead must reach a level of adequate development, the models had to apply another level of reasoning when deciding on this point. This might explain why the scoring of context wasn't as successful as the scoring of thesis and evidence. — by Scarlet and Grace; graph by Gabby and Aria
Evidence. Next, in terms of strength of performance, was the evidence part of the rubric. On this part, where scoring options range from 0 to 2 points, Grok and Co-Pilot performed best with an 88.9% match rate, followed by ChatGPT and DeepSeek at 83.3%. Given the fact that the evidence is scored out of 2 points, it is impressive that the average score of the LLMs was 81.5%. When compared to the average score of the thesis scoring (84.4%), which is based on only 1 point, the scores are very similar even though statistically there’s a greater probability for error when deciding on a 2-point range. We feel this likely has to do with the wording of the rubric, which the models were able to interpret and apply to the essay samples. The rubric calls for two pieces of evidence and distinguishes between 1 and 2 points by the quality of evidence and how the evidence is used: 1 point for evidence "relevant to the topic" and 2 points for "specific and relevant evidence" that "support an argument." This criteria is more clear than what appears in the reasoning and analysis part of the rubric, the other 2-point section. — by Maya; graph by Maya and Sarah
Reasoning and Analysis. When it came to the Reasoning and Analysis part of the rubric, where scoring options range from 0 to 2 points, the models struggled to match College Board scoring. Both Grok and Claude assigned these points correctly only 33.3% of the time. And while ChatGPT, Co-Pilot, and DeepSeek doubled that percentage (66.7%), these were still sizable scoring disagreements in comparison to the other sections of the rubric. The two-point value creates more room for error, but we think the difficulty for the models comes mostly from the rubric's language. While the first point, the reasoning point, simply asks if appropriate historical reasoning (e.g. comparison, causation) structures the argument, the second point for analysis is much more involved. It addresses whether the essay portrays a more complex understanding of the material. We believe the models had trouble deciding this second point because of the many variables built into the rubric's "decision rules," a broad umbrella of possibilities requiring more critical thinking. — by Hadleigh; graph by Gabby and Aria
Rationale Similarity
We found that when models matched the College Board's overall rubric score, their rationales showed a cosine similarity that was higher than when the overall rubric score did not match. But this increase was just slight, measuring a couple percentage points. Grok, for example, which had the highest overall cosine similarity of 71.8%, only increased to 73.3% when its overall rubric scoring agreed with the College Board. Frankly, we expected this to be more pronounced. Looking at it the other way, we assumed that notable dissimilarity in rubric scoring would be reflected in notably lower cosine similarity. But this wasn’t always the case. When looking at Perplexity, for example, it had an overall cosine similarity of 66.3%, but on one essay (2023 02A) where it differed from College Board rubric scoring by two points, its cosine similarity was actually higher than its average by 2%. And when looking at Co-Pilot's overall performance, on rubric scoring it was relatively high, but its rationale similarity scoring was relatively low. However, it's important to note that when looking at the data across all LLMs, whenever a cosine similarity score was 80% or higher in any particular section, that section was scored the same as the College Board in all instances. While the difference in rationales might limit the usefulness of LLMs as feedback tools, we feel it's important to recognize that the College Board's all-or-nothing, six-point scoring system could be the cause of some disparity. After all, like a human grader, a model could decide to award a point for one or more parts of an essay but, at the same time, convey in the rationale some level of weakness within those parts. In other words, a model could basically say good enough for the point, but just barely. — by Connor, Greyson, and Hadleigh; graph by Gabby and Aria
Our Data
About Us
"Evaluating Large Language Models as AP Essay Scorers" is the product of the 2024-2025 AP U.S. History class: Maya Bourque, Scarlet Lockhart, Sarah Malik, Connor Messenger, Grace Ritzmann, Stephen Roosevelt, Samantha Shughart, Gabby Soto, Greyson Spears, Hadleigh Spears, and Aria Varian.






