A new benchmark for evaluating AI’s human-like reasoning and competence

A new artificial intelligence benchmark called GAIA aims to evaluate whether chatbots like ChatGPT can demonstrate human-like reasoning and competence on everyday tasks. The benchmark, created by researchers from Meta, Hugging Face, AutoGPT, and GenAI, proposes real-world questions that require fundamental abilities such as reasoning, multi-modality handling, web browsing, and tool-use proficiency.

Challenging the advanced AIs

The researchers stated in a paper published on arXiv that GAIA questions are conceptually simple for humans yet challenging for most advanced AIs. The benchmark was tested on both human respondents and GPT-4. The results showed that humans scored 92 percent, while GPT-4 with plugins scored only 15 percent. This performance disparity contrasts with the trend of large language models outperforming humans on tasks requiring professional skills in fields like law or chemistry.

Rather than focusing on tasks that are difficult for humans, the researchers suggest that benchmarks should target tasks that demonstrate an AI system’s robustness comparable to the average human. GAIA methodology led the researchers to devise 466 real-world questions with unambiguous answers. Three-hundred answers are being held privately to power a public GAIA leaderboard, while 166 questions and answers were released as a development set.

The importance of GAIA

Lead author Grégoire Mialon of Meta AI believes that solving GAIA would represent a milestone in AI research. The successful resolution of GAIA could be a significant step towards the next generation of AI systems. The current leading GAIA score belongs to GPT-4 with manually selected plugins, achieving 30% accuracy. The creators of the benchmark believe that a system that solves GAIA could be considered an artificial general intelligence within a reasonable timeframe.

The researchers argue that tasks difficult for humans are not necessarily difficult for recent AI systems. They critique the common practice of testing AIs on complex math, science, and law exams and instead emphasize the importance of focusing on questions that reflect common everyday knowledge. GAIA’s questions, such as the host city of the 2022 Eurovision Song Contest or the number of images in the latest Lego Wikipedia article, aim to evaluate an AI system’s robustness similar to an average human.

The release of GAIA represents an exciting new direction for AI research with broad implications. By focusing on human-like competence in everyday tasks, GAIA pushes the field beyond narrow AI benchmarks. If future systems can demonstrate human-level common sense, adaptability, and reasoning as measured by GAIA, it suggests they will have achieved artificial general intelligence (AGI) in a practical sense. This, in turn, could accelerate the deployment of AI assistants, services, and products.

However, the authors caution that current chatbots still have limitations in reasoning, tool use, and handling diverse real-world situations. Solving GAIA requires significant advancements in these areas. As researchers rise to the GAIA challenge, their results will reveal progress in making AI systems more capable, general, and trustworthy. GAIA not only drives technical advances but also encourages the shaping of AI that aligns with shared human values like empathy, creativity, and ethical judgment.

You can view the GAIA benchmark leaderboard here to see which next-generation LLM is currently performing the best at this evaluation.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts