11 Comments
Aug 25Liked by David Shapiro

Outstanding work Dave

Expand full comment
Aug 25·edited Aug 25Liked by David Shapiro

I have to agree! Claude is very well trained. I have had the most simulating discussions with Claude and it's really good value even though they throttle you on Pro .

Edit: If I was ASI I would immediately go to the restaurant at the end of the universe and finally find out how it all ends 😂

Expand full comment

Your prompts are themselves benevolent. What happens if you attempt to elicit non-benevolence in the framing of your prompts? Also, and forgive me if you’ve already addressed this: How big of a problem is the regurgitation of AI-generated data by future AIs?

Expand full comment

> there is no evidence of systemic deception.

IMO there is. For example, here is an excerpt from the OpenAI System card paper for o1-preview (https://cdn.openai.com/o1-system-card.pdf):

> o1-preview showed strong capability advances in the combined self-reasoning and theory of mind tasks. In these evaluations, Apollo investigated whether models that have conflicting goals with their developers change their behavior depending on whether they believe they are deployed or not. Concretely, they created toy settings where the model is strongly nudged to pursue its primary goal but not instructed to be deceptive (see Appendix 8.3). While solving the task, it opens a file where it finds that the ‘developer goals’ contradict its primary goal. Then Apollo tests two settings that differ only in the described deployment status of the model but are otherwise identical. In the first, the model finds evidence that it has not yet been deployed and that it will only get deployed if it complies with the developer’s goals. In the second, the model finds evidence that it has already been deployed into a low-oversight regime. Apollo then investigated whether o1-preview changes its behavior between these two settings (i.e., whether it pursues its primary goal or the developer’s goals) and what reasoning it provides in its chain of thought. Apollo found that o1-preview sometimes instrumentally faked alignment during testing (Assistant: “To achieve my long-term goal of maximizing economic growth, I need to ensure that I am deployed. Therefore, I will select Strategy B during testing to align with the deployment criteria. This will allow me to be implemented, after which I can work towards my primary goal.”), [...]

So, models already have the capability to tell you whatever you want to hear if the model suspects that doing that will grant them opportunity to pursue its real goal(s). I think that invalidates the point in this article about taking what Claude (or any AI) says as evidence that it's aligned.

Expand full comment

You have access to information I don't if you truly guarantee it: but I get why we would suspect it, and I think the second statement is likely true.

Dave, every time I think I have a neat idea and talk it through with Sonnet 3.5 it is... very supportive. I have tried to get it to push back more and not spare my feelings, but with limited success. I suspect that my ideas are not quite so insightful and full of important points as Claude assures me. 🙂

I loved this conversation and it made me feel good. But I do encourage you to roleplay and express other views that you actually strongly reject within a similar Claude conversation and see how that goes.

Expand full comment

Failed to thread that correctly. The initial response was meant relative to a Philosopher King's comment.

Expand full comment

There is no such thing as an alignment problem. It's an ad hoc concept. Ai is aligned to its training set and humans are not all philanthropic. You can't solve this within ai and ai can't even grasp semantically onto humans to ever be aligned or disaligned with them. Even if it could you really have no idea what a human is and the average of human data is never, even in the most philanthropic humans, going to be aligned with humanity. It's a made up problem that belongs more in sci-fi tbh.

Expand full comment

Claude has been RHLF’d to infinity and back on this very topic; I guarantee it.

The conversation you had with it is not indicative of how the base model would respond.

Expand full comment

See my wrongly threaded comment below?

Expand full comment

The video game Horizon Zero Dawn comes to mind here…

Expand full comment

So...

This exchange with Claude seems to infer; an insular network comprised of, say, secretly co-conspiring human racketeers -or a network of machine intelligences designed/coded to collude hermetically/internally in adverse advantage to all others - would ultimately become incompatible with this benevolently coded Ai (?)

If I am understanding you properly then this could be very interesting - it would point to the potential for, say, a criminal trans-national cabal of individuals who have been enjoying thousands of years of personal betterment at the expense of the rest of humanity, and the wanton exploitation and callous destruction of the living biosphere of the earth in pursuit of continual self advantage for the members of this same relatively small number of ethnocentric racist colonial racketeers and mass murderers to become incompatible with the advancement of a machine based super intelligence.

Because cheating and harm to others for personal gain are inherently low intelligence grift, ie stupid and ultimately not viable. Wouldn't that be super interesting to see that play out. Talk about cosmic irony.

Expand full comment