The Strawberry Juice is Not Worth the Squeeze
I stress tested OpenAI's Strawberry... and was not impressed
UPDATE: This blog post aged poorly.
For some context,
OpenAI has just released their o1-preview model, which is a beta test of their first “Orion” series models that use “Strawberry.” If all that sounds somewhat cryptic it’s because they’ve deliberately chosen meaningless keywords for their secret projects in the same way the CIA and NSA does. Like the upshot knothole nuclear test. The words were more or less chosen randomly. I suspect OpenAI did this because Jimmy_Apples (a Twitter leaker) disclosed their “Arrakis” model back in 2023, and let’s just say the internet collectively disapproved of that model name.
So what is Strawberry and what does it do?
If you want to know a little bit more about it, you can take a much closer look on OpenAI’s release blog for the new model. The very short version is that it uses a form of self-prompting called “Chain of Thought” (CoT) reasoning, which is nothing new. The first CoT paper was published at NeurIPs back in January 2022.
OpenAI has obfuscated the reasoning from users, primarily because this deepens and widens its moat. Self-prompting (or what I originally called “metaprompting” way back in June 2022, is nothing new but it isn’t the easiest to integrate into a stable, commercial project. So, long story short, OpenAI have commercialized a technique that is almost 3 years old.
Great, but what does this actually do?
The simplest analogy is that it enables LLMs to “think through” or iterate on thoughts. If a single-shot from ChatGPT is like Daniel Kahneman’s system 1 thinking then Strawberry enables system 2 thinking. While this is certainly a step in the right direction, it’s nothing new, and this is the direction I’ve been saying it needs to go all along. In other words, it’s a rudimentary cognitive architecture.
Cognitive Architecture
Below is a video I released in May 2021 of a chatbot I built for Discord called RAVEN. This was my inaugural AGI project where I focused on alignment via cognitive architecture and multi-step self-prompting.
In fact, I published my very first book, Natural Language Cognitive Architecture, just over two years ago, in the early days of 2022.
You might be wondering: Dave, why are you driving home all these dates?
Great question: OpenAI has nothing new.
What this says about OpenAI
My read on this is that OpenAI has nothing innovative or new. As we’ve observed before, OpenAI makes good use of integrating the research of others and bringing it to market, but there’s no “secret sauce.”
In fact, when asked about building GPT-4, Sam Altman admitted that there was no big breakthrough, they were just good at integrating hundreds of little improvements. Almost all of which came from the scientific research community. Such improvements include papers on attention windows (basically, algorithms that allow ChatGPT to “pay attention” to the right context of the conversation).
Still Fails Simple Tasks
During my livestream to explore o1-preview with my audience, I was given some “teaser questions”
write a 10 word coherent sentence where each word has exactly 1 letter more than its previous word starting with first word having one letter and so on till tenth word having 10 letters
This first teaser question ultimately yielded no results. I killed it after 230 seconds.
My audience, however, came up with a few answers:
I am not free today, Jordan, because galaxies regularly caramelize
I am not some great prompt command follower brilliant mastermind
Okay, these aren’t super coherent, but close enough.
Here’s another test that it completely failed at (never rendered results)
A city adds more buses to reduce delays and overcrowding during peak hours, but this leads to increased traffic congestion, higher fuel costs, and continue passenger dissatisfaction. Despite more buses, schedules remain unreliable. Identify the feedback loops and unintended consequences causing these issues. Propose a solution that addresses the root causes without creating new problems elsewhere in the system.
I looked at this for about 15 seconds and came up with two explanations:
The number of buses was not the constraint, the chokepoint is elsewhere.
Look at other things, such as bus routes, stops, fueling depots, and schedules.
So clearly it’s “reasoning” abilities are still limited, even on simple tasks. Claude, by the way, barfed out a good-enough answer immediately. So in this case, the juice is not worth the squeeze.
Now, with that being said, there were a few tests that it did well on. Here’s one that requires a lot of reasoning, and it ultimately figured out a good solution for:
You are faced with three gods, A, B, and C. Their names are True, False, and Random, but you do not know which is which. True always tells the truth. False always lies. Random sometimes tells the truth and sometimes lies—how Random behaves is completely unpredictable. You have to determine the identities of A, B, and C by asking three yes-no questions. Each question must be addressed to one god only, but you are allowed to ask different gods different questions. The gods understand English, but will answer all questions in their own language, where the words for "yes" and "no" are "ja" and "da," in some order. You do not know which word means "yes" and which means "no."
Strawberry iterated through each God and question and concocted a hypothesis about what question it would ask each one, and what it’s answer would mean either way. I don’t know if the solution it rendered was accurate because the livestream moved on, but I was impressed with the reasoning either way. It took about a minute or two to render.
The Juice Isn’t Worth the Squeeze
So, now we have a model that uses Chain-of-Thought reasoning, whether it’s appropriate or not, that costs 100x their current flagship model. This is not an exaggeration.
Now, by comparison, o1-preview costs $15 per million input tokens
My diagnosis is somewhat grim: OpenAI are compensating for lack of underlying model progress by simply throwing tokens (money) at the problem. LangChain, an open source startup, makes Chain-of-Thought easy and practical, and has been around for more than a year already.
So, OpenAI integrating chain-of-thought prompting into their flagship model is sort of like when Google and Apple integrated Swype (slide typing) into their keyboards.
I’m also not the first person to notice that the hype is tamping down, and I’m also glad for it.
There have been a slew of financial and business commentators saying that the honeymoon is over, that generative AI is headed for the trough of disillusionment. AI companies can’t find product-market-fit let alone price-market-fit.
You only have to ask yourself one question:
Is the marginal improvement in reasoning worth 100x the price?
The value prop is looking thinner and thinner.
Corporations Doing What Corporations Do
I’ve made the sausage, I know how to finetune models. Here’s an example of what I mean: I was finetuning ChatGPT-like chatbots 6 months before ChatGPT was released. There’s no magic to it, but if you’ve never done it, it can look like magic to the outsider.
OpenAI is now just gobbling up cool features that other people invented and perfected. They are no longer a science or research company, which would explain the mass exodus of actual scientists and researchers.
Long story short: OpenAI is very clearly following in the footsteps of Google, Microsoft, Meta, and Apple. They are simply becoming a business productivity shop, and there’s nothing wrong with that. It’s probably why Sam Altman has been making less noise about AGI lately.
Where I’d Go From Here
I don’t envy OpenAI or the other big AI shops. I’ve been saying for a while that AI is slowing down. I got more than a little hate for that observation, and OpenAI’s flop of a Strawberry release will almost certainly be the final nail in the coffin for a lot of business stakeholders. Sam Altman, and others at OpenAI, have talked about “intelligence too cheap to meter” which has fallen completely flat with the release of o1-preview.
Open source models are not that far behind frontier models. Meta is working on commoditizing AI. If your business is predicated on a single tech (LLMs) your days are probably numbered. Unlike CPUs, which have dependencies and architectures, LLMs are totally fungible: they all use natural language. Most of the data comes from the internet. Most of the research comes from arxiv.
OpenAI: Being joined at the hip with Microsoft sealed their fate long ago. Anthropic threw in with Amazon, which makes sense. The Big Tech companies, the actual companies with a durable business model, were always going to win. If I were Sam, I’d be looking for the exits, to start something new and fresh like what Ilya did. I bet this is why Greg Brockman is taking an extended leave of absence: they got nothing, their mission is winding down, and the noose is closing.
Microsoft: If I were Satya, I would probably start to tighten my grip, and start squeezing. OpenAI is not any closer to AGI, and the marginal utility of keeping OpenAI separate is diminishing. Microsoft has some stellar in-house R&D teams, plus experimental product teams. The usefulness of OpenAI is falling off a cliff. However, due to inertia and optics, Microsoft will have to play nice and let this play out naturally. OpenAI is running out of money, surviving on investment, and someone is going to have to pay up, one day.
¯\_(ツ)_/¯
UPDATE: I realized I didn’t quite say where the path forward it. The path forward, to me, is very obviously multi-agent frameworks, such as Microsoft’s AutoGen or AgentForge.
From a business POV, you’ll have more transparency, explainability, and flexibility with multi-agent systems. They also allow you to bake in model routing (basically performing model arbitrage to optimize costs) and tool use, all baked in.
Advantages of multi-agent systems versus these expensive models include:
Model fungibility: you can swap out faster, cheaper models when it makes sense, and use big expensive models only when it counts. This does several things: saves a ton of money, and makes the overall framework much faster (rather than relying on a single gigantic model). This is a permanent feature of optimization. TLDR: Exploit the fast/cheap resource as much as possible.
Tool use: Pick models that know how to use specific tools (function calling, API calls, etc). That includes writing code, calling APIs, and basically creating an extensible framework. It’s sort of like a microservices architecture with AMQP routing.
Flexibility of instantiation: When you have a bunch of agents, each with specializations, you can spin up or spin down agents just like we already do with containers (tiny virtualized applications) and microservices. In other words, enterprises already have a model to manage fleets of ephemeral worker nodes. Agents are the new worker nodes.
Transparency and traceability: You can monitor the communication between agents, with supervisor layers, manager layers, resource and RBAC layers, etc. Building one monolithic model that does all its reasoning in a blackbox is untraceable. Especially with OpenAI obfuscating the reasoning. I don’t think enterprises, governments, or militaries will be willing to adopt that.
Hey Dave
Thanks for the write-up and your conclusions. Agree with the most. From my perspective, this is somehow ironic since at the very beginning after chatGPT came out and the presentation of GPT-4 (with the famous "build me a website from an image on a napkin sketch"), Sam toured around promoting the idea that many companies/researchers/developers could build on top of that model - this is what happened with the specialized models (either open source or proprietary models as base).
Also, the obsession with LLMs doing math (and especially arithmetic) - why would that be a goal for such an approach when it is that easy to connect it to python. I had a two-liner in my instructions to do everything math-related with python only in my chatGPT and hence it never failed on the strawberry problem for example. This is not to say that for complex mathematical problem the LLM route is worthless, but then just build a math-specialist model which then doesn't know anything about baseball or Game of Thrones. If an encompassing view is needed for a given problem, the router could still ask the GoT or baseballDB specialist...
That being said, I threw one of the problems again at it (maybe overloaded yesterday):
I came up with:
---
Thinking for 47 seconds
Constructing a sentence
I’m creating a 10-word sentence where each word escalates by one letter, starting from 1 to 10 letters. Progressing through various coherence levels to shape a meaningful sentence.
Listing potential words
I’m identifying words of lengths from 1 to 10, aiming to construct a coherent sentence. This involves brainstorming various word choices to ensure clarity and relevance.
Selecting words
I’m looking at words of varying lengths, aiming to create a coherent sentence. The challenge is to choose words that fit the structured pattern.
Building upon the sentence
I’m crafting a sentence where each word increases in length. Initially, I chose "I am the," aligning with the escalating word lengths and adjusting to match criteria.
Structuring sentence patterns
I’m starting by listing words of lengths 1 to 10, aiming to form a coherent sentence. Progress is made by testing sentences and adjusting word lengths.
Crafting coherent sentences
I’m experimenting with various word lengths and arrangements to form meaningful sentences, but so far, none have been coherent. I’m narrowing down options and considering suitable words to create a sentence.
Evaluating choices
Hmm, I’m thinking through 7-letter words for the seventh spot. Options like "perhaps" and "consider" come to mind, but they don't fit.
Word selection
I evaluated 10-letter words like "wonderful" and tried incorporating smaller words to form a coherent sentence with the given letter lengths. Progress was made by aligning words with the required lengths.
Piecing together possibilities
I’m considering adjusting the sentence by using "I" or thinking of a sentence that fits the criteria. Progressing through different word lengths to form a coherent sentence.
Piecing together words
I’m assembling a sentence with various word lengths to ensure a balanced structure. Next up: crafting a cohesive sentence with selected words.
Evaluating sentence structures
I’m thinking through various sentence structures that increase in word length, considering words of specific lengths to fit the criteria. Progress is steady, and ideas are evolving.
numeracy
I’m organizing the sentence into silliness, selecting precise word lengths for coherence. This method requires careful word choice.
Tackling a pangram
I’m exploring the challenge of constructing a pangram with words ranging from 1 to 10 letters. Despite various attempts, adherence to word lengths and coherence remain elusive. Hm, that's intriguing.
Revisiting thoughts
I explored sentences with word lengths from 1 to 10 characters but found them incoherent. Attention shifts to coherent construction ideas.
Mapping out alternatives
I’m pulling together various sentence suggestions for the 1 to 10 letter pattern. Despite some progress, achieving a seamless and coherent sentence proves challenging.
Evaluating sentence structures
I’m mapping out various sentences to match specific word lengths, but so far, none align perfectly. Now, I’m considering if success is possible with the given constraints.
Connecting ideas
I’m thinking about pangrammatic windows, where each word adds a letter. Exploring sentences with incremental word lengths, despite some discrepancies.
Testing word sequences
I’m mapping out different word sequences to align with a progressive length pattern. Despite multiple attempts, none consistently meet the criteria.
After much thought, encountering multiple options with varied word lengths, the conclusion is uncertain on how to match the constraints effectively.
Evaluating sentence structures
I’m examining sentences with increasing word lengths, but they don’t align with the specified pattern. Considering admitting defeat.
I am sad when birds freely explore unknown mountain landscapes.
---
not that bad ... but also not a very intelligent route to do this. Now, a missed chance is that the "Thought process" doesn't seem to be kept in the conversation memory (all the messages that have been generated by self-prompting and maybe multi-sampling). That is, in a follow up question for more, it couldn't build up on these temporary built "thought pieces" or intermediary results - something that would be crucial for real progress of a more online learning system (i.e. solving a problem once and if successful able to remember the solution steps and even better what to avoid).
Here is the link (with the failed second attempt to build up on the first one and with a failure for the next results as well): https://chatgpt.com/share/66e43d0a-6aa0-8007-a652-831e634f6bff