The Principal-Agents Problems 2: Are Models Getting Dumber to Save Money? What the "Stealth Quantization" Hypothesis Tells Us About Trust, Information, and Incentives
I had originally planned to write this as a single post, but it keeps growing as more relevant news stories come out. So instead, this will become a series of stories on the competing incentives involved in creating “AI agents” and why that matters to you as the end user.
Multiple Principals, Multiple Agents (Not only AI)
You, as the user of AI tools, may choose software vendors who provide you access to their products with built-in AI features including AI agents. These vendors might have specialist software like Harvey, Westlaw, or LexisNexis; or Cursor or Github Copilot; or generalist tools like Notion, Salesforce, or Microsoft Copilot. The AI features may be powered by one or more foundation models provided to those vendors by AI labs, such as Anthropic (Claude), OpenAI (ChatGPT), Meta (Llama) or Google (Gemini).
These relationships mean you have the principal-agent problem of you hiring the vendor. But you also have the principal-agent problem of the vendors hiring the AI labs. Each has their own incentives, and they are not perfectly aligned. There is also significant information asymmetry. The vendors know more about their software and AI model choices than you do. The labs know more about their AI models than either you or the software vendors.
Lexis+ AI uses both OpenAI’s GPT models and Anthropic’s Claude models, according to its product page, as I mentioned in my analysis of the Mata v. Avianca case.
The Stealth Quantization Hypothesis
The area I'll focus on in this post is the concept of alleged stealth quantization. According to a wide range of commenters, primarily among computer programmers and primarily focused on Claude users, there are certain times of days or days of the week when peak usage results in models "getting dumber," "getting lazier," "being lobotomized" or otherwise underperforming their normal benchmarks and perceived optimal behavior. According to these claims, it is better for users with high-value use cases (like someone modifying important source code) to schedule Claude for off-peak usage so the "real model" runs. To save on computing costs during periods of high demand, the claim is that Anthropic or whichever AI lab swaps out its flagship model with a quantized version while calling it the same thing.
So what is normal, non-stealth quantization? It's making an AI model smaller and cheaper to run, but less accurate. This is achieved by rounding the model weights to smaller significant figures (e.g., 16-bit, 8-bit, 4-bit).(Meta) By analogy, the penny was recently discontinued. Now, all cash transactions will end in 5 cents or 0 cents. Quantization works like this with the precisions of AI models: imagine eliminating a penny, then a nickel, then a dime, and so on.
There are legitimate reasons to quantize models, such as reducing operating costs when the loss in accuracy is negligible for the intended use or when the model needs to operate on a personal computer. For example, Meta offers some quantized versions of its Llama family of large language models that can run on ollama on modern laptops or desktops with only 8GB of RAM.(Llama models available on ollama) These models have names that distinguish them from the non-quantized versions, e.g., "llama3:8b" is Llama 3, 8 billion parameter size of that series; "llama3:8b-instruct-q2_K" is a quantized version of the instruct version model of that same model.
If all that terminology is confusing, here's the key point. AI labs have a lot of information about their AI models. You have a lot less information. You have to mostly take their word for it. They are also charging you for an all-you-can-eat buffet at which some excessive customers cost them tens of thousands of dollars each.
Anthropic's Rebuttal
Users have accused Anthropic (and other AI labs) of running different versions of their flagship models at different times of day, but the models are labelled the same (e.g., Claude Sonnet 4), regardless of the time of day. Hence “stealth quantization.”
Anthropic has denied stealth quantization. But Anthropic did acknowledge two problems with model quality that had been noted by users as evidence of stealth quantization. Anthropic attributed this to bugs. Anthropic stated “we never intentionally degrade model quality as a result of demand or other factors, and the issues mentioned above stem from unrelated bugs.” Reddit, Claude
No Inside Information, Just Incentives
My point, here is not to render judgement on stealth quantization. To their credit, Anthropic admitted there were performance issues; they could have quietly patched the bugs without acknowledging the performance degradation to avoid drawing attention to the stealth quantization hypothesis. But I think it is an excellent example of the mistrust caused by information asymmetry combined with misaligned economic incentives. There are outspoken computer programmers (primarily on Reddit and X) who are seemingly very knowledgeable, who are apparently heavy Claude users, and who are also apparently convinced that Anthropic and other AI labs are engaging in stealth quantization. I don't know for sure what is going on to lead users to believe that and I do not know for sure whether or not there really is stealth quantization.
What I do know is that some users proved there was a change in Claude's performance when they were trying to prove stealth quantization. And I know that the versions of Claude without and without bugs were all called the same thing. They were called "Claude Sonnet 4" and "Claude Haiku 3.5" before, during, and after the bug fixes.Claude So users noticed performance had degraded for a model that had the same name as it had before, but it wasn't really the same model. At least some users were right that the thing called "Claude Sonnet 4" in mid-August was performing worse for some users than the thing called "Claude Sonnet 4" either before the bug or after the fix. Then, when Anthropic fixed it, the model was still called Claude Sonnet 4.
A Partial End to the All-You Can Eat Buffet
There was another reason for the "stealth quantization" hypothesis": timing. The bugs were acknowledged as starting in August 2025. Anthropic had just implemented daily usage limits at the end of July 2025. Some users on the $200/month Claude plans had been using tens of thousands of dollars worth of Claude each month, prior to the limits, according to Anthropic. As of November 2025, Anthropic has implemented weekly limits in addition to daily limits. The thinking goes that if the AI labs are facing too much demand, they could balance the load at peak times by serving up a quantized version of the model. Hence the complains about supposedly "lobotomized" models during busy times on weekdays.
As I mentioned above, users are not AI labs' only customers. They also provide foundation models to software vendors. For example, Lexis+ AI uses both OpenAI’s GPT models and Anthropic’s Claude models, including the Claude Sonnet 4 model that was impacted by the performance issue in August and September 2025. If you used software that uses Claude Sonnet 4 during that time, was your work impacted by this bug? It's hard to say because the labs and the software vendors have more information than you and incentives to withhold it from you.
That's why it is helpful to receive generative AI training or governance consultation from a third-party that is not selling you the AI software, such as the services offered by Midwest Frontier AI Consulting. We can't completely solve the incentives problems for you, but our focus on educating clients can prepare you to be a more informed consumer of whichever AI products you decide to use.