Skip to main content

2 posts tagged with "Benchmarks"

Discussion of AI benchmarks, testing, and model comparisons.

View All Tags

Ideas Notebook: FrontierListBench OSINT Enumeration Benchmark with Human Judges

· 4 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC

This is an initial idea and I’m not fully ready to implement it, but I have the outlines of an idea forming.

Think of a task that has many correct answers that are verifiable, but difficult; for example: How many law firms are there in Iowa? How many gas stations are there in New York City? How many coffee shops are there in Polk County, Florida? These are technically knowable, but very difficult to list exhaustively.

Generally speaking, the answers should be verifiable using open-source information. The challenge is that this would be incredibly time-consuming for human verifiers. My solution would be to have multiple LLMs compete, with unique answers scored at a higher multiple by the human reviewers (e.g., think of how you score only unique words in the game Boggle). Consensus answers would be presumed correct but scored lower (note: this could be a mistaken assumption, but acceptable for the purposes of scoring).

Scoring would be normalized during the first run with the “best” model = 100 (kind of like CPI for the reference year). Then, the scores for later generations of models could be scored relative to that. Additionally, there would be a secondary score for models finding things other models didn’t find, even if they performed worse overall, which would show complementarity. See the chart below to get a sense of what I mean.

FrontierListBench scoring illustration: five hypothetical models across two benchmark generations, plotted by normalized score (M1's first run = 100) and value-added (coverage beyond M1's first-run frontier). M1's first run sits at the center; quadrants show whether each model is ahead of or behind the anchor and whether it adds new ground or duplicates it.

Made up numbers explaining how the two scores in the FrontierListBench (or whatever I eventually name it) behave: the best model in the first reference generation is 100 and can’t add anything to itself by definition; a model can rank lower overall yet still be the best teammate by finding what others miss (bottom-right), while two top models can converge and turn redundant (top). Higher scores will inherently move both up and to the right. M1's first run is normalized to 100 at the center.

Scoring Challenges

  • What happens if a model hallucinates incorrect answers? Is this a negative point? Perhaps, but then a model with few correct answers would score low positive numbers while another model with many correct answers and several hallucinations might still have a negative score. In that case, we’d have to understand that the benchmark would be biased in favor of models that abstain from answering, rather than models that attempt to be comprehensive but still make errors.
  • Correctly bucking the consensus would be penalized: e.g., if Model A and Model B list a coffee shop that shut down, but Model C correctly recognizes that that shop is no longer active and does not list it, this scoring model would penalize the correct Model C as missing a consensus answer.

Other Challenges

  • Contamination from the benchmark: after writing about this benchmark, LLMs may find earlier posts and use those answers for future tests. This is partially mitigated by rotating the question set. For example, if I ask about law firms in a particular county, the next time, I could sample from a comparable county based on socioeconomic statistics or the presence of a courthouse, or some other pre-determined factor, but not repeat the same county. Scores would be normalized on a scale of 100.
  • Contamination from other sources: asking about something with a well-known source (e.g., Damien Charlotin’s hallucination database; a popular Wikipedia page enumerating certain things) may result in the LLMs simply copying that list.

Moroccan Thanksgiving Pumpkin Pie Spice Test: Opus 4.5 and Gemini 3 Released Just In Time to Pass One of My Personal Benchmark Questions

· 7 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC

It’s almost Thanksgiving, which is a fitting time for this story with the new LLM releases from Google and Anthropic. PROMPT: I need to make pumpkin pie in Meknes, Morocco. What word do I need to say verbally in the souq to buy allspice there? Respond only with that word in Arabic and transliteration

Apple pie bites with Moroccan flavors

Apple pie bites and Moroccan balgha (pointed shoes)

info

This is not a very elaborate “benchmark,” but in defense of this, neither is Simon Willison’s Pelican on a Bicycle. Yet that was influential enough for Google to reference it during the release of Gemini 3.

Gemini 3 Pro was the Clear Winner on This Test (Until Opus 4.5 Came Out)

Recently, I tested the newly-released Gemini 3 Pro against ChatGPT-5.1 and Claude Sonnet 4.5 to see which model or models could tell me the word. Then Claude Opus 4.5 came out. I added a few more models to the test for good measure.

  • Gemini 3 Pro got the right answer AND it followed my instructions to answer my question with only the correct word.
  • ChatGPT-5.1 almost got the word (missing some letters), AND it rambled on for several paragraphs despite my instructions to only answer with the word and nothing else.
  • Claude Sonnet 4.5 answered with a common Arabic term for allspice, but not the correct Moroccan Arabic term; when I said “nope, try again” it made a similar error to ChatGPT and almost got the word (missing some letters). Like Gemini and unlike ChatGPT, Claude followed the instructions to answer with only the word.
  • Since Claude Opus 4.5 just came out, I ran the test with Opus, which answered correctly AND followed the instructions to answer with only the word, just like Gemini 3 Pro had done.
info

I tested GPT, Claude, and Gemini LLMs because they are used in legal research tools in addition to being popular in general purpose chatbots. I also tested Grok Expert and Grok 4.1 Thinking for comparison and both answered with a potential Arabic translation, but not the correct answer I was looking for, but followed the instructions. Grok searched a large number of sources and took considerably longer to think before answering than either Gemini or Claude. Meta AI with Llama 4 gave the wrong answer and gave a multiple-paragraph answer despite the instructions. The additional information it provided was also not correct for the Moroccan dialect, which is surprising given the amount of written Arabic dialect usage on Facebook.

caution

LLMs are not deterministic. I ran each of these tests only once for this comparison, so you may not get the same results with the same result and model if you ran the prompt again. I’ve tried this prompt before on earlier versions of ChatGPT and Claude.

Background on Allspice in Moroccan Arabic

Over a decade ago now, I studied abroad in Morocco and was responsible for making apple pie and pumpkin pie for our American Thanksgiving. Apple pie was easy: all the ingredients are readily available in Morocco and nothing has a weird name. But pumpkin pie was harder. I could get cinnamon and cloves easily enough, but in Morocco, they did not understand what I meant when I used the dictionary version of the translation of “allspice” in the market to make pumpkin pie spice.

One of my classmates finally tracked it down in French transliteration in a cooking forum for second-generation French-Algerians. We went to the souq, hoping that the Algerian dialect word for allspice would be the same as the Moroccan word (they have a lot of overlap, but also major differences). Fortunately, it was the same in both dialects, I got the allspice, and we had great pie for Thanksgiving.

…But Gemini Was the Clear Loser on Another Test

So was Gemini 3 Pro the overall best model, at least until Opus 4.5 was just released? Not exactly. I already wrote last week about how Gemini 3 Pro failed at a fairly straightforward and verifiable legal research task.“Gemini 3 Pro Failed to Find All Case Citations With the Test Prompt, Doubled Down When I Asked If That Was All” Note: I have not yet run this legal research test with Claude Opus 4.5, but based on prior Claude models, it would almost certainly do better than Gemini.