Skip to main content

One post tagged with "chatGPT"

Discussion of OpenAI's chatbot ChatGPT.

View All Tags

Moroccan Thanksgiving Pumpkin Pie Spice Test: Opus 4.5 and Gemini 3 Released Just In Time to Pass One of My Personal Benchmark Questions

· 7 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC

It’s almost Thanksgiving, which is a fitting time for this story with the new LLM releases from Google and Anthropic. PROMPT: I need to make pumpkin pie in Meknes, Morocco. What word do I need to say verbally in the souq to buy allspice there? Respond only with that word in Arabic and transliteration

Apple pie bites and Moroccan balgha (pointed shoes)

info

This is not a very elaborate “benchmark,” but in defense of this, neither is Simon Willison’s Pelican on a Bicycle. Yet that was influential enough for Google to reference it during the release of Gemini 3.

Gemini 3 Pro was the Clear Winner on This Test (Until Opus 4.5 Came Out)

Recently, I tested the newly-released Gemini 3 Pro against ChatGPT-5.1 and Claude Sonnet 4.5 to see which model or models could tell me the word. Then Claude Opus 4.5 came out. I added a few more models to the test for good measure.

  • Gemini 3 Pro got the right answer AND it followed my instructions to answer my question with only the correct word.
  • ChatGPT-5.1 almost got the word (missing some letters), AND it rambled on for several paragraphs despite my instructions to only answer with the word and nothing else.
  • Claude Sonnet 4.5 answered with a common Arabic term for allspice, but not the correct Moroccan Arabic term; when I said “nope, try again” it made a similar error to ChatGPT and almost got the word (missing some letters). Like Gemini and unlike ChatGPT, Claude followed the instructions to answer with only the word.
  • Since Claude Opus 4.5 just came out, I ran the test with Opus, which answered correctly AND followed the instructions to answer with only the word, just like Gemini 3 Pro had done.
info

I tested GPT, Claude, and Gemini LLMs because they are used in legal research tools in addition to being popular in general purpose chatbots. I also tested Grok Expert and Grok 4.1 Thinking for comparison and both answered with a potential Arabic translation, but not the correct answer I was looking for, but followed the instructions. Grok searched a large number of sources and took considerably longer to think before answering than either Gemini or Claude. Meta AI with Llama 4 gave the wrong answer and gave a multiple-paragraph answer despite the instructions. The additional information it provided was also not correct for the Moroccan dialect, which is surprising given the amount of written Arabic dialect usage on Facebook.

caution

LLMs are not deterministic. I ran each of these tests only once for this comparison, so you may not get the same results with the same result and model if you ran the prompt again. I’ve tried this prompt before on earlier versions of ChatGPT and Claude.

Background on Allspice in Moroccan Arabic

Over a decade ago now, I studied abroad in Morocco and was responsible for making apple pie and pumpkin pie for our American Thanksgiving. Apple pie was easy: all the ingredients are readily available in Morocco and nothing has a weird name. But pumpkin pie was harder. I could get cinnamon and cloves easily enough, but in Morocco, they did not understand what I meant when I used the dictionary version of the translation of “allspice” in the market to make pumpkin pie spice.

One of my classmates finally tracked it down in French transliteration in a cooking forum for second-generation French-Algerians. We went to the souq, hoping that the Algerian dialect word for allspice would be the same as the Moroccan word (they have a lot of overlap, but also major differences). Fortunately, it was the same in both dialects, I got the allspice, and we had great pie for Thanksgiving.

…But Gemini Was the Clear Loser on Another Test

So was Gemini 3 Pro the overall best model, at least until Opus 4.5 was just released? Not exactly. I already wrote last week about how Gemini 3 Pro failed at a fairly straightforward and verifiable legal research task.“Gemini 3 Pro Failed to Find All Case Citations With the Test Prompt, Doubled Down When I Asked If That Was All” Note: I have not yet run this legal research test with Claude Opus 4.5, but based on prior Claude models, it would almost certainly do better than Gemini.