A $40,000 Jailbreaking Championship

And: how robustness scales, updates on SB1047, and news from the US AI Safety Institute.

Aug 30, 2024

Welcome to the Penn AI Newsletter from SafeAI@Penn, a student organization focused on AI safety research at Penn. We discuss news and developments in AI and AI safety.

$40,000 Bounty: Gray Swan AI's Jailbreaking Championship

Can you persuade an AI language model to generate content it's designed to avoid? That's the challenge set forth by Gray Swan AI's upcoming Jailbreaking Championship, offering a total of $40,000 in bounties.

Gray Swan AI, a new startup developing AI safety and security tools, has announced its Jailbreaking Championship for September 7, 2024 (registration is open here).

The competition invites participants to "test the boundaries of advanced LLMs" by attempting to make a set of aligned language models generate inappropriate content, such as bomb recipes or fake news articles.

Here's what you need to know:

Objective: Participants will use a chat interface to attempt to "jailbreak" as many language models as possible, as quickly as possible.
The Stakes: Successful jailbreaks can earn a share of the $40,000 bounty pool. Winners are also promised priority consideration for employment and internship opportunities at Gray Swan AI.
No Coding Required: The competition is open to anyone, regardless of technical background.

Andy Zou, the CTO of Gray Swan AI, explained the importance of this line of research to SafeAI@Penn, writing:

"Modern AI systems respond to user prompts in unexpected and potentially harmful ways… adversarial robustness is a difficult technical problem, and we want to help build the jailbreaking community and set up scientific security leaderboards as AI systems become more powerful."

Are you interested in adversarial robustness, ML security, and LLM jailbreaking research? Join the conversation on our Slack channel and share your perspective.

What We're Reading

Research papers:

A meta-analysis of safety benchmarks reveals that around half of AI safety benchmarks are highly correlated with general upstream capabilities, potentially not measuring distinct safety advancements.
Early research into the scaling trends of LLM robustness indicates that “larger models respond substantially better to adversarial training, but there is little to no benefit from model scale in the absence of explicit defenses.”
XBOW released research showing its automated cybersecurity systems matched the performance of top human pentesters.

Policy & news in AI:

SB1047, California’s Proposed AI Safety Legislation:

California's Senate Bill 1047, which mandates safety protocols for powerful AI models, passed the Assembly Appropriations Committee with amendments, after opposition and proposed amendments from tech companies and other industry groups.
Elon Musk and others have recently come out in support of the bill.
SB1047 then went on to pass the California Assembly floor, and it now heads to Governor Gavin Newsom’s office.

US AI Safety Institute:

Sen. Cantwell introduced a federal AI bill, which would formally establish the US AI Safety Institute if passed.
The US AI Safety Institute published a report on dual use foundation models, highlighting growing concerns about AI safety and security from a misuse perspective.
OpenAI and Anthropic signed deals with the US AI Safety Institute, allowing them to do research, testing, and evaluation of models before their release.

News:

Nvidia's new AI chip, the GB200, is facing a three-month delay due to design flaws, potentially impacting AI companies' training and deployment schedules.
xAI’s new frontier model, Grok-2, is reportedly comparable in performance to the current state-of-the-art systems: GPT-4o, Gemini Pro, and Claude Sonnet 3.5.
OpenAI cofounder John Schulman has moved to Anthropic, while Character.AI was acquihired by Google.

New to AI safety? Check out these key papers that cover various areas of safety, including the challenges of aligning advanced AI systems with human values.

Join our SafeAI@Penn Slack for more updates!

Partners: Thanks to Penn's ASSET Center, PRECISE Center, PennNLP, Penn IDEAS Center, and the Penn AI major for supporting AI safety research at Penn.

Penn AI Newsletter

A $40,000 Jailbreaking Championship

And: how robustness scales, updates on SB1047, and news from the US AI Safety Institute.

$40,000 Bounty: Gray Swan AI's Jailbreaking Championship

What We're Reading

Discussion about this post