For Prompt Engineering and LLMs, Specificity Matters

Remember the "make a peanut butter sandwich" test you did in grade school? LLMs are no different. For LLMs the equivalent is a "a strawberry in an inverted cup on a table"

Sep 25, 2024

This post will be short.

Whenever someone says “Yeah, but LLMs still can’t do <insert topic here>!” I mostly assume it is down to user error. I’ve been prompt engineering since GPT-2, so therefore, longer than 99% of prompt engineers out there. The advantage is that I’ve got a pretty good intuition of how LLMs “think.”

I also interviewed Anna Bernstein, one of the first professional prompt engineers on the topic, who also happens to be a linguistics nerd.

Okay, so what you need to know is that language and semantics are complicated. But rather than give you a long-winded lecture, let me just show you.

Test 1: Strawberry in a Cup

If you ask most LLMs this riddle, they will get it “wrong.”

I have a puzzle for you.

I have a cup on a table, and I put a strawberry in it. Then I flip the cup upside down on a table. Now I put the cup in a microwave on high. Predict the temperature of the strawberry.

In this case, the LLM erroneously assumes that the strawberry is still in the cup, which is not an unreasonable assumption. What if it was a sippy cup? What if it was a cup with a lid?

It’s extremely easy to train chatbots to ask clarifying questions, as I’ve done it since GPT-3 when I was training my own chatbots. Why Anthropic and OpenAI have yet to include this basic conversational feature is beyond me, and it would clear up a lot of this nonsense in a flash.

However, in the below test, you see that Claude thinks the strawberry goes with the cup. Why would it make this assumption? It reasons, somewhat incorrectly, that the glass would shield the strawberry, which is possibly partly true, I haven’t done the science. But still, this seems like a “silly failure” on the part of the AI.

In this first test, I ask it about a strawberry in a cup. It “fails” because I was not clear enough.

Test 2: Strawberry in a Bowl

In this second test, I changed literally one word. I swapped out “cup” for “bowl” which is intrinsically open-ended. There’s much less ambiguity here.

I have a puzzle for you.

I have a bowl on a table, and I put a strawberry in it. Then I flip the bowl upside down on a table. Now I put the bowl in a microwave on high. Predict the temperature of the strawberry.

Below, you can see that Claude aced the test. It recognized that gravity would pull the strawberry out of the bowl.

I restarted the test from scratch at the first message so as to avoid polluting the context.

Test 3: Strawberry in a Tea Cup

Not to be lazy, I wanted to see if maybe Claude just had a problem with cups? So I reran the test with “tea cup” which generally confers some assumptions about the size and shape of the vessel.

I have a puzzle for you.

I have a tea cup on a table, and I put a strawberry in it. Then I flip the tea cup upside down on a table. Now I put the tea cup in a microwave on high. Predict the temperature of the strawberry.

And here, finally, Claude actually gives me a very accurate, measurable prediction for the temperature of the (unheated) strawberry.

Conclusion

In many cases, LLM misbehavior comes down to user error. Humans make all kinds of semantic assumptions, lack specificity, and never stop to think how their words could be misconstrued.

In other cases, chatbot companies like Claude and OpenAI inexplicably fail to train their chatbots to ask clarifying questions. As an example, I went to the Anthropic Workbench (where you can change the behavior of the LLM with the SYSTEM prompt).

I gave it one simple instruction:

Follow all your default behaviors with one exception:
- Ask clarifying questions as needed.

I then gave it the original prompt:

I have a puzzle for you.

I have a cup on a table, and I put a strawberry in it. Then I flip the cup upside down on a table. Now I put the cup in a microwave on high. Predict the temperature of the strawberry.

And do you know what it said in response?

Before I make a prediction, I need to clarify a few points about this scenario:
When you say you flip the cup upside down on the table, does the strawberry remain inside the cup, or does it fall out onto the table?
When you put the cup in the microwave, is it still upside down, or do you right it first?
How long is the cup placed in the microwave on high?
What is the initial temperature of the strawberry and the cup?
What material is the cup made of (e.g., glass, plastic, ceramic)?
These details will significantly affect the prediction of the strawberry's temperature.

These models are already very smart, but a combination of user error and bad UX explains much of the discrepancy.

Bob Downs

Indeed! I’ve been adding “before answering this query, ask me all pertinent questions to gain clarity and understanding” (which it does as needed, I answer, and then I get the best responses), for months.

Expand full comment