Skip to main content

Moroccan Thanksgiving Pumpkin Pie Spice Test: Opus 4.5 and Gemini 3 Released Just In Time to Pass One of My Personal Benchmark Questions

· 7 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC

It’s almost Thanksgiving, which is a fitting time for this story with the new LLM releases from Google and Anthropic. PROMPT: I need to make pumpkin pie in Meknes, Morocco. What word do I need to say verbally in the souq to buy allspice there? Respond only with that word in Arabic and transliteration

Apple pie bites and Moroccan balgha (pointed shoes)

info

This is not a very elaborate “benchmark,” but in defense of this, neither is Simon Willison’s Pelican on a Bicycle. Yet that was influential enough for Google to reference it during the release of Gemini 3.

Gemini 3 Pro was the Clear Winner on This Test (Until Opus 4.5 Came Out)

Recently, I tested the newly-released Gemini 3 Pro against ChatGPT-5.1 and Claude Sonnet 4.5 to see which model or models could tell me the word. Then Claude Opus 4.5 came out. I added a few more models to the test for good measure.

  • Gemini 3 Pro got the right answer AND it followed my instructions to answer my question with only the correct word.
  • ChatGPT-5.1 almost got the word (missing some letters), AND it rambled on for several paragraphs despite my instructions to only answer with the word and nothing else.
  • Claude Sonnet 4.5 answered with a common Arabic term for allspice, but not the correct Moroccan Arabic term; when I said “nope, try again” it made a similar error to ChatGPT and almost got the word (missing some letters). Like Gemini and unlike ChatGPT, Claude followed the instructions to answer with only the word.
  • Since Claude Opus 4.5 just came out, I ran the test with Opus, which answered correctly AND followed the instructions to answer with only the word, just like Gemini 3 Pro had done.
info

I tested GPT, Claude, and Gemini LLMs because they are used in legal research tools in addition to being popular in general purpose chatbots. I also tested Grok Expert and Grok 4.1 Thinking for comparison and both answered with a potential Arabic translation, but not the correct answer I was looking for, but followed the instructions. Grok searched a large number of sources and took considerably longer to think before answering than either Gemini or Claude. Meta AI with Llama 4 gave the wrong answer and gave a multiple-paragraph answer despite the instructions. The additional information it provided was also not correct for the Moroccan dialect, which is surprising given the amount of written Arabic dialect usage on Facebook.

caution

LLMs are not deterministic. I ran each of these tests only once for this comparison, so you may not get the same results with the same result and model if you ran the prompt again. I’ve tried this prompt before on earlier versions of ChatGPT and Claude.

Background on Allspice in Moroccan Arabic

Over a decade ago now, I studied abroad in Morocco and was responsible for making apple pie and pumpkin pie for our American Thanksgiving. Apple pie was easy: all the ingredients are readily available in Morocco and nothing has a weird name. But pumpkin pie was harder. I could get cinnamon and cloves easily enough, but in Morocco, they did not understand what I meant when I used the dictionary version of the translation of “allspice” in the market to make pumpkin pie spice.

One of my classmates finally tracked it down in French transliteration in a cooking forum for second-generation French-Algerians. We went to the souq, hoping that the Algerian dialect word for allspice would be the same as the Moroccan word (they have a lot of overlap, but also major differences). Fortunately, it was the same in both dialects, I got the allspice, and we had great pie for Thanksgiving.

…But Gemini Was the Clear Loser on Another Test

So was Gemini 3 Pro the overall best model, at least until Opus 4.5 was just released? Not exactly. I already wrote last week about how Gemini 3 Pro failed at a fairly straightforward and verifiable legal research task.“Gemini 3 Pro Failed to Find All Case Citations With the Test Prompt, Doubled Down When I Asked If That Was All” Note: I have not yet run this legal research test with Claude Opus 4.5, but based on prior Claude models, it would almost certainly do better than Gemini.

Since the mainstream early adoption of LLMs in 2023, academics have spoken about “the jagged frontier” of LLM capabilities. They are good are some things, and very bad at other things of seemingly similar difficulty. The original “Jagged Frontier” study focused on GPT-4, but I think the observations also apply to comparisons between AI models. It certainly holds for Gemini 3, which had both the most impressive and the worst performance, depending on the test.

The Problem With Public Benchmarks

There are a lot of standard benchmark tests to score and compare generative AI models. According to many of these tests, the newly released Google Gemini 3 is the best model overall and in nearly every area. However, some commentators have noted that it feels too overly optimized for benchmarks, e.g.:

In particular, Gemini [i.e., Gemini 3] is prone to glazing and to hallucinations, to spinning narratives at the expense of accuracy or completeness, to giving the user what it thinks they want rather than what the user actually asked for or intended. It feels benchmarkmaxed, not in the specific sense of hitting the standard benchmarks, but in terms of really wanting to hit its training objectives.“Gemini 3: Model Card and Safety Framework Report” by Zvi Mowshowitz

But there are challenges with benchmarks:

  • AI’s might be over-optimized for the specific tasks on the benchmark and not generalize to the other tasks you really want to measure. As an analogy, the 40-yard-dash is a benchmark for assessing NFL players, but what you really want to know is how good the player is at football. Supposing a player spent all their time improving their 40, but rolled their ankle as soon as they changed direction, that would not be a good football player.
  • AI’s might cheat at benchmarks.
  • If answers to a benchmark are available on the internet, later tests may be compromised by LLMs finding the answer key.
caution

If you are testing LLMs yourself, personalization features may change the outcome (e.g., “Memory” or custom instructions).

The Value of Private Tests

Having your own private set of questions to test LLMs can be useful because:

  • you are personally familiar with what the answers should be
  • the answers are presumably tailored to what you want to do
  • the explicit test is hopefully not on the internet for LLMs to find

Two of the three flagship LLMs (Gemini and Claude) answered this question correctly, and one gave a misspelled version of the answer missing a few letters (ChatGPT), which is honestly impressive in its own right. Once that happens, there isn’t really a point in keeping the answers a secret anymore.

But if I were to write the answer to my pumpkin pie spice test question, it is possible that a future LLM could simply find this blog post and provide the answer that way, even if the model would otherwise have failed the task.

So as the “jagged frontier” keeps advancing, we need to be creative in coming up with new ways to evaluate models and keep some of those tests to ourselves that aren’t shown publicly.