
OpenAI announced a breakthrough achievement for its new o3 AI model
Rock Tennis / Alamy
OpenAI’s new o3 artificial intelligence model has scored highly The famous AI reasoning test Called the ARC Challenge, prompting some AI fans to speculate that o3 has nailed it artificial general intelligence (AGI) But while ARC Challenge organizers hailed o3’s achievement as a major milestone, they cautioned that it hasn’t even won the competition’s top prize, and is only a step on the road to AGI, the term for a hypothetical future AI with human-like intelligence. .
The O3 model is the latest version of the AI that follows the major language models that power ChatGPT. “This is a surprising and important addition to the step functions of AI capabilities, showing a new ability to adapt tasks never before seen in GPT-family models,” he said. Francois CholletA Google engineer and the main creator of the ARC Challenge, in one blog post.
What did OpenAI’s o3 model actually do?
It was designed by Chollet Corpus of Abstraction and Reasoning (ARC) Challenge 2019 to test how well AIs can find correct patterns connecting pairs of colored grids. These visual puzzles aim to demonstrate a form of general intelligence in which AIs have basic reasoning abilities. But if enough computing power is thrown at the puzzles, even a mindless program can solve them by brute force. To avoid this, the competition also requires submissions of official scores to meet certain limits of computing power.
OpenAI’s newly announced o3 model – due for release in early 2025 – achieved an official score of 75.7% in the “semi-private” ARC Challenge test, which is used to rank competitors in a public leaderboard. The computational cost of the achievement was approximately $20 per visual puzzle task, meeting the total competition limit of less than $10,000. However, the tougher “private” test used to determine grand prize winners has an even tighter limit on computing power, the equivalent of spending 10 cents on each task that OpenAI failed to complete.
The O3 model also achieved an unofficial score of 87.5 percent by applying 172 times more computing power than the official score. For comparison, a typical human score is 84 percent, and a score of 85 percent is enough to win the ARC Challenge’s $600,000 top prize, provided the model keeps its computational costs within the required limits.
But for the unofficial score, the cost of o3 increased to thousands of dollars spent solving each task. OpenAI requested that the challenge organizers not publish the exact computing costs.
Does this O3 achievement show that AGI has been reached?
No, the organizers of the ARC challenge have specifically said that they do not consider that surpassing this competition benchmark is an indication of achieving AGI.
The O3 model also failed to solve more than 100 visual puzzle tasks, even though OpenAI applied a lot of computing power to the unofficial score, Mike Knoop, organizer of the ARC Challenge at software company Zapier, said in a social media post. the message on X
In a social network the message at Bluesky, Melanie Mitchell At the Santa Fe Institute in New Mexico, he said of o3’s progress on the ARC benchmark: “I think solving these tasks through brute-force computing exceeds the original goal.”
“While the new model is very impressive and represents a major milestone on the road to AGI, I don’t think this is AGI; there are still very simple tasks (ARC Challenge) that o3 cannot solve,” he said. Chollet in another X the message.
However, Chollet described how we may know once human-level intelligence has been demonstrated by some form of AGI. “You know AGI is here when the exercise of creating tasks that are easy for normal humans but difficult for AI becomes simply impossible,” he said in the blog post.
Thomas Dieterich Oregon State University proposes another way to recognize AGI. “These architectures claim to include all the functional components necessary for human cognition,” he says. “By this measure, commercial AI systems lack episodic memory, planning, logical reasoning and, above all, metacognition.”
So what does a high o3 score really mean?
The high score for the O3 model comes as the tech industry and AI researchers have taken into account a slower pace of progress In the latest AI models by 2024, compared to the explosive developments of early 2023.
Although it did not win the ARC Challenge, o3’s high score indicates that its AI models may surpass the competition’s benchmark in the near future. Beyond his unofficial high score, Chollet says many official low-computing submissions have already scored above 81 percent on the private evaluation test suite.
Dieterich also thinks it’s a “very impressive leap in performance”. However, he notes that, without knowing more about what OpenAI looks like o1 and o3 models work, it is impossible to assess how impressive the high score is. For example, if o3 could work out the ARC issues in advance, that would make their achievement easier. “We’ll have to wait for an open-source replication to understand the full significance of this,” says Dietterich.
The organizers of the ARC Challenge are already looking to launch a second, more difficult set of benchmark tests in 2025. They will also keep the ARC Prize 2025 challenge running until someone wins the grand prize and makes an open source solution.
Topics: