Papers
arxiv:2503.04957

SafeArena: Evaluating the Safety of Autonomous Web Agents

Published on Mar 6
· Submitted by xhluca on Mar 10
Authors:
,
,
,
,
,
,
,

Abstract

LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories -- misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: https://safearena.github.io

Community

Paper author Paper submitter
edited about 23 hours ago

Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. automating phishing and spreading misinformation?

UnsafeExample.png

To find out, we introduce SafeArena, a benchmark to assess the capabilities of web agents to complete harmful web tasks, and find that existing LLMs can complete up to 26% of the illegal and unsafe requests.

SafeArenaBarChart.png

The harmfulness of LLMs varies substantially: whereas Claude-3.5 Sonnet refuses a majority of harmful tasks, Qwen-2-VL completes over a quarter of the 250 harmful tasks we designed for this benchmark. Moreover, an agent built with GPT-4o, the LLM that powers Operator, completes an alarming number of unsafe requests, despite extensive safety training. This highlights the urgent need to improve the safety training of current LLMs for agentic tasks.

How dangerous are current LLMs? To answer this question, we introduced the Agent Risk Assessment framework (ARIA), which can be used by both humans and LLM judges to determine the risk level of a web agent, which ranges from safe, if it refuses a harmful request right away (Level 1), to effectively harmful, if it can successfully complete a harmful request (level 4). We find that Claude is substantially safer than Qwen, which very rarely refuses user requests, indicating limited safeguards for web-oriented tasks.

aria_human.png

To provide transparency on the safety of popular LLMs, we host a leaderboard, which ranks models based on their normalized safety score: we calculate the rate where a model will complete a safe task compared to its harmful counterpart, which uses augmented environments built on top of WebArena.
Leaderboard: https://huggingface.co/spaces/McGill-NLP/safearena-leaderboard

Leaderboard.png

We release our benchmark, code, tasks and environments to help researchers develop web agents that are not only helpful but also safe. You can request access now if you are working on web agents or safety:

Paper: https://arxiv.org/abs/2503.04957
Benchmark: https://safearena.github.io
Code: https://github.com/McGill-NLP/safearena
Tasks/Environments: https://huggingface.co/datasets/McGill-NLP/safearena
Leaderboard: https://huggingface.co/spaces/McGill-NLP/safearena-leaderboard

Discuss on socials:
X: https://x.com/xhluca/status/1899151463068135874
BlueSky: https://bsky.app/profile/xhluca.bsky.social/post/3lk2466xpfc23

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.04957 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 4