The Strawberry Juice is Not Worth the Squeeze

David Shapiro

Sep 13, 2024

I stress tested OpenAI's Strawberry... and was not impressed

Read →

3 Comments

Micams

Sep 13

Hey Dave

Thanks for the write-up and your conclusions. Agree with the most. From my perspective, this is somehow ironic since at the very beginning after chatGPT came out and the presentation of GPT-4 (with the famous "build me a website from an image on a napkin sketch"), Sam toured around promoting the idea that many companies/researchers/developers could build on top of that model - this is what happened with the specialized models (either open source or proprietary models as base).

Also, the obsession with LLMs doing math (and especially arithmetic) - why would that be a goal for such an approach when it is that easy to connect it to python. I had a two-liner in my instructions to do everything math-related with python only in my chatGPT and hence it never failed on the strawberry problem for example. This is not to say that for complex mathematical problem the LLM route is worthless, but then just build a math-specialist model which then doesn't know anything about baseball or Game of Thrones. If an encompassing view is needed for a given problem, the router could still ask the GoT or baseballDB specialist...

That being said, I threw one of the problems again at it (maybe overloaded yesterday):

I came up with:

---

Thinking for 47 seconds

Constructing a sentence

I’m creating a 10-word sentence where each word escalates by one letter, starting from 1 to 10 letters. Progressing through various coherence levels to shape a meaningful sentence.

Listing potential words

I’m identifying words of lengths from 1 to 10, aiming to construct a coherent sentence. This involves brainstorming various word choices to ensure clarity and relevance.

Selecting words

I’m looking at words of varying lengths, aiming to create a coherent sentence. The challenge is to choose words that fit the structured pattern.

Building upon the sentence

I’m crafting a sentence where each word increases in length. Initially, I chose "I am the," aligning with the escalating word lengths and adjusting to match criteria.

Structuring sentence patterns

I’m starting by listing words of lengths 1 to 10, aiming to form a coherent sentence. Progress is made by testing sentences and adjusting word lengths.

Crafting coherent sentences

I’m experimenting with various word lengths and arrangements to form meaningful sentences, but so far, none have been coherent. I’m narrowing down options and considering suitable words to create a sentence.

Evaluating choices

Hmm, I’m thinking through 7-letter words for the seventh spot. Options like "perhaps" and "consider" come to mind, but they don't fit.

Word selection

I evaluated 10-letter words like "wonderful" and tried incorporating smaller words to form a coherent sentence with the given letter lengths. Progress was made by aligning words with the required lengths.

Piecing together possibilities

I’m considering adjusting the sentence by using "I" or thinking of a sentence that fits the criteria. Progressing through different word lengths to form a coherent sentence.

Piecing together words

I’m assembling a sentence with various word lengths to ensure a balanced structure. Next up: crafting a cohesive sentence with selected words.

Evaluating sentence structures

I’m thinking through various sentence structures that increase in word length, considering words of specific lengths to fit the criteria. Progress is steady, and ideas are evolving.

numeracy

I’m organizing the sentence into silliness, selecting precise word lengths for coherence. This method requires careful word choice.

Tackling a pangram

I’m exploring the challenge of constructing a pangram with words ranging from 1 to 10 letters. Despite various attempts, adherence to word lengths and coherence remain elusive. Hm, that's intriguing.

Revisiting thoughts

I explored sentences with word lengths from 1 to 10 characters but found them incoherent. Attention shifts to coherent construction ideas.

Mapping out alternatives

I’m pulling together various sentence suggestions for the 1 to 10 letter pattern. Despite some progress, achieving a seamless and coherent sentence proves challenging.

Evaluating sentence structures

I’m mapping out various sentences to match specific word lengths, but so far, none align perfectly. Now, I’m considering if success is possible with the given constraints.

Connecting ideas

I’m thinking about pangrammatic windows, where each word adds a letter. Exploring sentences with incremental word lengths, despite some discrepancies.

Testing word sequences

I’m mapping out different word sequences to align with a progressive length pattern. Despite multiple attempts, none consistently meet the criteria.

After much thought, encountering multiple options with varied word lengths, the conclusion is uncertain on how to match the constraints effectively.

Evaluating sentence structures

I’m examining sentences with increasing word lengths, but they don’t align with the specified pattern. Considering admitting defeat.

I am sad when birds freely explore unknown mountain landscapes.

---

not that bad ... but also not a very intelligent route to do this. Now, a missed chance is that the "Thought process" doesn't seem to be kept in the conversation memory (all the messages that have been generated by self-prompting and maybe multi-sampling). That is, in a follow up question for more, it couldn't build up on these temporary built "thought pieces" or intermediary results - something that would be crucial for real progress of a more online learning system (i.e. solving a problem once and if successful able to remember the solution steps and even better what to avoid).

Here is the link (with the failed second attempt to build up on the first one and with a failure for the next results as well): https://chatgpt.com/share/66e43d0a-6aa0-8007-a652-831e634f6bff

Expand full comment

Reply (1)

Micams

Sep 16

ok, it seems the method to reuse once learned steps-to-solve a problem (or any kind of partial problem - so modularity, here we go) is already been explored:

https://arxiv.org/pdf/2409.07429: AGENT WORKFLOW MEMORY

while web-browsing and doing tasks there seems really targeted at the whole automate-web-workflow-XY for agents, this will pay off quite soon to make the o1 models 100x cheaper once they have gathered enough data for successful runs and modularize this. Only step needed is to make this in a online-learning fashion so to build-up iteratively (during a "session"). Once you have that, you can just start the model with the ability to call-in any depth of known-solutions-for-similar problems (as "subroutines" or "sub-problem-solvers"). connections to fact-delivering devices (DBs, APIs, websearch/answer tools like perplexity, ...) should be injectable as well. train is getting to speed up soon. Hope, you'll be able to speed up raspberry and maybe also head towards the problem-solving for non-axiomatic systems.

Expand full comment

Comment removed

Expand full comment

If what they produced was novel, then they would get the credit they deserve. You don't get brownie points for spending $100m repackaging 4o with a clean but hella expensive agentic self-reflection wrapper.

Expand full comment

David Shapiro’s Substack

The Strawberry Juice is Not Worth the Squeeze