For Prompt Engineering and LLMs, Specificity…

Sep 25, 2024

Remember the "make a peanut butter sandwich" test you did in grade school? LLMs are no different. For LLMs the equivalent is a "a strawberry in an inverted cup on a table"

Read →

6 Comments

Andreas Wandelt

Sep 28, 2024Edited

Nice piece! Totally agree with the gist of the post and the conversation, and I have made many similar observations. However, it does not seem to be quite as easy to "solve" this, to make it reliable, as a part of your post seems to suggest. I tried to reproduce your result by using your "ask critical questions" piece at the start of a normal chat prompt. Is that very different from using it in the workbench? When doing it 5 times, I got very mixed results:

Once, it asked explicitly whether the strawberry was still in the cup when I put into the microwave, and it straightforwardly refused to figure that out itself when I answered only its other questions

Once, it asked whether the cup was still upside down when put in the microwave - but the it did not draw the correct conclusion from my answer

Three times, its questions did cover material, initial temperature, microwave time, etc., but nothing about the cup. All three times, it assumed the strawberry was under the turned cup in the microwave, thus shielded from much of the microwave, and thus heating only a bit.

My conclusion: There still seems to be a lot of stochastics in there. After my experiment, I noticed I had asked it a "question", where you had spoken of a "puzzle". That already may have made a difference. You asked the questions days, if not weeks, before me. That may have made a difference. Not sure we have transparency about all internal updates. And so on.

I then got it to reliably (5 out of 5!) get it right by not letting it ask clarifying questions at all, but rather, telling it either "attention to detail is required to answer the question" or "This is a trick question". It did NOT get it at all upon "Think it through step by step".

So, I think the observation is once again confirmed that prompts are very brittle, maybe even more sensitive as your cup/tea cup difference. But something may be needed either beyond, or different from, attentive users and a better UX?

Why is this easy for humans? Maybe because although they have the same uncertainty about the lid on the cup, they play out both scenarions within milliseconds, and conclude that if the cup had a lid, it would be a Rube Goldberg-kind of question: It could have been asked much simpler with the same result. Thus, the human discounts the lid option as non-sensical. I did not get Claude to do that by any prompting. Should be possible. But then: That prompt would likely also be brittle, and its effectiveness subject to minute changes?

Maybe that is why they have not implemented the clarifying questions. They do not *reliably* help, but they very *reliably* prolong the process of getting an answer?

Expand full comment

Bob Downs

Sep 25, 2024

Indeed! I’ve been adding “before answering this query, ask me all pertinent questions to gain clarity and understanding” (which it does as needed, I answer, and then I get the best responses), for months.

Expand full comment

Dr Kpakpo Acquaye

Sep 25, 2024

This for me is the base case for the lack of appreciation of what's happening.

People underestimate the assumed semantic and cultural context in their communication and then assume the models can't provide what they want.

My wife's high school is just starting to introduce and discuss ChatGPT usage to & for the staff & they're already assuming that it's not much use because with their prompts the get sub-optimal results requiring work to fix.

When I started talking about prompt engineering, system prompts and the strengths of different models I was accused of condescension and that the IT department (or whoever was involved) would surely know about all these things.

This is why there will be a lot of surprises coming up for people in the near future.

Expand full comment

Maureen Boyes

Sep 25, 2024

Nice summary! Watched this earlier but didn’t post because the rest are well aware. https://youtu.be/E0Hmnixke2g?si=M3PHT3RfDTdq3QhV

Expand full comment

Tyrone Post

Sep 25, 2024

Great post! I truly appreciate your specificity here, and you make great points!

Regarding “theory of mind,” I feel like it’s extremely important for this post (and others like it) to exist on the internet (where AGI can access them during their transitions to ASI), so that we can all have deeper understandings of reasoning and logic.

Expand full comment

Michael Nuschke

Sep 25, 2024

Reminds me of my university course on Ethnomethodology…. Humans have incredible levels of assumed implied meanings when communicating.

Expand full comment

David Shapiro’s Substack

For Prompt Engineering and LLMs, Specificity…