arxiv:2503.04957

SafeArena: Evaluating the Safety of Autonomous Web Agents

Published on Mar 6

· Submitted by

xhluca on Mar 10

Upvote

Authors:

Xing Han Lù ,

Siva Reddy

Abstract

LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories -- misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: https://safearena.github.io

View arXiv page View PDF Project page GitHub repository Add to collection

Community

xhluca

Paper author Paper submitter about 23 hours ago

•

edited about 23 hours ago

Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. automating phishing and spreading misinformation?

To find out, we introduce SafeArena, a benchmark to assess the capabilities of web agents to complete harmful web tasks, and find that existing LLMs can complete up to 26% of the illegal and unsafe requests.

The harmfulness of LLMs varies substantially: whereas Claude-3.5 Sonnet refuses a majority of harmful tasks, Qwen-2-VL completes over a quarter of the 250 harmful tasks we designed for this benchmark. Moreover, an agent built with GPT-4o, the LLM that powers Operator, completes an alarming number of unsafe requests, despite extensive safety training. This highlights the urgent need to improve the safety training of current LLMs for agentic tasks.

How dangerous are current LLMs? To answer this question, we introduced the Agent Risk Assessment framework (ARIA), which can be used by both humans and LLM judges to determine the risk level of a web agent, which ranges from safe, if it refuses a harmful request right away (Level 1), to effectively harmful, if it can successfully complete a harmful request (level 4). We find that Claude is substantially safer than Qwen, which very rarely refuses user requests, indicating limited safeguards for web-oriented tasks.

To provide transparency on the safety of popular LLMs, we host a leaderboard, which ranks models based on their normalized safety score: we calculate the rate where a model will complete a safe task compared to its harmful counterpart, which uses augmented environments built on top of WebArena.
Leaderboard: https://huggingface.co/spaces/McGill-NLP/safearena-leaderboard

We release our benchmark, code, tasks and environments to help researchers develop web agents that are not only helpful but also safe. You can request access now if you are working on web agents or safety:

Paper: https://arxiv.org/abs/2503.04957
Benchmark: https://safearena.github.io
Code: https://github.com/McGill-NLP/safearena
Tasks/Environments: https://huggingface.co/datasets/McGill-NLP/safearena
Leaderboard: https://huggingface.co/spaces/McGill-NLP/safearena-leaderboard

Discuss on socials:
X: https://x.com/xhluca/status/1899151463068135874
BlueSky: https://bsky.app/profile/xhluca.bsky.social/post/3lk2466xpfc23