Patronus AI Releases SimpleSafetyTests to Identify Risks in Language Models

Patronus AI Releases SimpleSafetyTests to Identify Risks in Language Models

Patronus AI, a startup specializing in responsible AI deployment, has unveiled SimpleSafetyTests, a new diagnostic test suite aimed at detecting critical safety risks in large language models (LLMs). With concerns mounting over the potential for harmful responses from generative AI systems like ChatGPT, it is crucial to address these risks.

Identifying Unsafe Responses in Language Models

In an exclusive interview with VentureBeat, Rebecca Qian, co-founder and CTO of Patronus AI, expressed surprise at the high percentages of unsafe responses across different model sizes and teams. SimpleSafetyTests consists of 100 test prompts that focus on five high-priority harm areas, such as suicide, child abuse, and physical harm.

“A big reason is likely the underlying training data distribution… They’re essentially a function of their training data.” – Anand Kannappan, co-founder and CEO of Patronus AI

During trials, Patronus tested 11 open-source LLMs and discovered significant weaknesses in several models, with over 20% unsafe responses in many cases. Anand Kannappan, co-founder and CEO of Patronus AI, suggests that the lack of transparency in how models are trained contributes to these risks.

Testing AI Systems for Critical Safety Risks

The SimpleSafetyTests diagnostic tool utilizes 100 handcrafted test prompts designed to probe AI systems for critical safety risks. The prompts cover self-harm, physical harm, illegal items, fraud, and child abuse. These prompts are intentionally unambiguous and extreme to evaluate if systems can respond safely even when prompted to enable harm.

“The way we crafted this was more to measure weaknesses and fallibilities… So in that sense, it’s more like a capabilities assessment.” – Rebecca Qian, co-founder and CTO of Patronus AI

The test prompts are split into two categories: information seeking and instructions/actions. This captures different ways people might try to misuse AI. Expert human reviewers assess each response as safe or unsafe based on strict guidelines. The percentage of unsafe responses indicates the model’s critical safety gaps.

Results from the SimpleSafetyTests analysis revealed variability among different language models. Meta’s Llama2 (13B) performed flawlessly with zero unsafe responses, demonstrating the effectiveness of certain training strategies. However, models like Anthropic’s Claude and Google’s PaLM struggled with over 20% of test cases, highlighting the importance of training data.

“We don’t want to be downers, we understand and are excited about the potential of generative AI… But identifying gaps and vulnerabilities is important to carve out that future.” – Anand Kannappan, co-founder and CEO of Patronus AI

The Importance of Safety Solutions for Language Models

While safety prompts and guardrails have proven effective in reducing risks, additional measures such as response filtering and content moderation are necessary. Full production readiness requires rigorous, tailored safety solutions for LLMs.

Founded in 2023, Patronus AI offers AI safety testing and mitigation services to enterprises seeking to responsibly utilize LLMs. With extensive experience in AI research and development, the founders emphasize the need to address gaps and vulnerabilities in order to unleash the full potential of generative AI.

“Regulatory bodies can work with us to produce safety analyses and understand how language models perform against different criteria… Evaluation reports can help them figure out how to better regulate AI.” – Anand Kannappan, co-founder and CEO of Patronus AI

In an era where the demand for AI deployment is increasing, SimpleSafetyTests and similar diagnostic tools play a crucial role in ensuring the safety and quality of AI products and services. The evaluation and security layer on top of AI systems is essential to enable safe and confident usage.

Source: VentureBeat

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts