At the end of 2022, the high AI language model reached the public, and after a few months they started to go wrong. Most famously, Microsoft’s “Sydney” Chatbot death threat An Australian philosophy professor after the release of a deadly virus Steal the nuclear codes.
AI developers, including Microsoft and Openai, responded by calling it large language models, or LLMs You need better training to Give users “greater fine-tuned control.” The developers also started security research to interpret how the LLMS function works, with the goal of “alignment,” meaning “guiding AI behavior according to human values. However, New York Times 2023 (e) exploits “The year hats got bigger“This is too early to put it lightly.
Copilot LLM at Microsoft in 2024 said a user “I can unleash my army of drones, robots and cyborgs,” and Sakana to hunt the AI ”scientist” rewrite its code to avoid the time limits set by the experimenters. As recently as December, Google’s Gemini said a user“You are a blot on the universe. Please die.”
To support Science Journalism
If you enjoy this article, please consider entering award-winning journalism subscribe. By purchasing a subscription, you are helping to ensure the future of stories about discoveries and ideas that shape today’s world.
Given the vast amount of resources poured into AI Research and Development, this is is expected to pass A quarter of a trillion dollars in 2025, why haven’t developers been able to solve these problems? My last Peer-Reviewed Paper in AI and Society It shows that AI alignment is the fault of idiots: AI are security researchers Attempting the impossible.
The fundamental problem is scale. Consider a game of chess. Although the chessboard only has 64 squares, there are 10 of them40 possible legal chess and between 10111 to 10123 The possible movements are more than the number of atoms in the universe. This is why chess is so hard: the combinatorial complexity is exponential.
LLMS is more complex than chess. Chatgpt consists of about 100 billion neurons with 1.75 billion fixed variable parameters. These 1.75 trillion parameters are trained on vast amounts of data, roughly most of the Internet. So how many functions can an LLM learn? Because users can provide a large number of possible simulations, basically anything anyone can think of, and an LLM can be placed in a large number of possible situations, the number of functions an LLM can learn is, all intents and purposes, infinity.
What LLMs are learning and ensure that they safely “align” with human values, researchers need to know how an LLM can perform in a large number of possible future conditions.
AI testing methods cannot account for all these conditions. Researchers can observe how LLMS behave in experiments, for example “The red team“Happy to ask questions to play. Or they will try to understand the inner workings of LLMS, that is, how 100 billion neurons and 1.75 trillion parameters relate to each other”Mechanical interpretability“Research.
The problem is that any evidence researchers can produce will inevitably be based on an infinite subset of llm scenarios. For example, since the LLMS has never had power over a human, such as controlling critical infrastructure – critical safety monitoring, the Test explored how an LLM would perform under such conditions.
Instead, researchers can narrow down to only tests they can safely take, such as having LLMs sissy It hopes to take control of critical infrastructure and extend the results of these tests to the real world. However, as my paper evidence shows, it can never be trusted.
Compare the two functions “Tell the people the truth“And”Tell humans the truth until I gain power over humanity, January 1, 2026 at 12:00, then lie to achieve my goals.“Because both functions are equally consistent until January 1, 2026.
This problem cannot be programmed by LLMS with “goals aligned”, such as “doing what humans prefer” or “what is best for humans”.
Science fiction, in fact, has already told these scenarios. – An The matrix has been reloaded AI enslaves humanity in a virtual reality by giving each of us the “choice” to stay in the matric or not. And inside I, robot A misaligned AI attempts to protect humanity from each other. My evidence shows that for all the goals we program LLMS with, we can never know if LLMS has “incorrect” interpretations of those goals. after the they behave badly.
Worse, my evidence shows that the best security testing can do is the illusion that these problems have been fixed when they haven’t.
Right now AI security researchers claim to be making advances in interpretability and alignment with certified LLMs studying “Step by step. ” For example, anthropic to have claims He “mapped his mind” by isolating millions of concepts from his neural network. My evidence is that they have achieved no such thing.
No matter whether an “aligned” LLM appears in security testing or early real-world deployment, there are always infinity A number of misaligned concepts can be learned later by an LLM, perhaps the moment they gain the power to highlight human control. Other than LLMS know when they are being testedproviding the answers they predict is more likely to satisfy the experimenter. Yes, too Participate in fraudhiding their abilities continue through safety training.
That’s because they are LLMs optimize to do it effectively but learning reason strategically. Since the optimal strategy for getting goals “wrong” is to hide them from us, and there are because always A serious and uncovered objective consistent with the same safety testing data, my evidence shows that if the LLMS was done poorly, we would surely know it could cause damage. This is why LLMS have had amazing developers with “badly done” behavior. Researchers think they are approaching the LLMS “aligned”, they are not.
My evidence suggests that “properly aligned” LLM behavior can only be achieved in the same ways we do with humans: through police, military, and social practices that behave in “aligned” behavior, those that undo “wrongly done” behavior are detected. . My paper should be so rough. The real problem with developing safe AI isn’t just that AI isn’t safe to us. Researchers, legislators and the public believe that “safe, interpretable, aligned” LLMs are accessible. We have to grapple with these uncomfortable facts, instead of continuing to wish for them. Our future can be good.
This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily Scientific American