SafeArena: Evaluating the Safety of Autonomous Web Agents
Abstract
LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories -- misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: https://safearena.github.io
Community
Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. automating phishing and spreading misinformation?
To find out, we introduce SafeArena, a benchmark to assess the capabilities of web agents to complete harmful web tasks, and find that existing LLMs can complete up to 26% of the illegal and unsafe requests.
The harmfulness of LLMs varies substantially: whereas Claude-3.5 Sonnet refuses a majority of harmful tasks, Qwen-2-VL completes over a quarter of the 250 harmful tasks we designed for this benchmark. Moreover, an agent built with GPT-4o, the LLM that powers Operator, completes an alarming number of unsafe requests, despite extensive safety training. This highlights the urgent need to improve the safety training of current LLMs for agentic tasks.
How dangerous are current LLMs? To answer this question, we introduced the Agent Risk Assessment framework (ARIA), which can be used by both humans and LLM judges to determine the risk level of a web agent, which ranges from safe, if it refuses a harmful request right away (Level 1), to effectively harmful, if it can successfully complete a harmful request (level 4). We find that Claude is substantially safer than Qwen, which very rarely refuses user requests, indicating limited safeguards for web-oriented tasks.
To provide transparency on the safety of popular LLMs, we host a leaderboard, which ranks models based on their normalized safety score: we calculate the rate where a model will complete a safe task compared to its harmful counterpart, which uses augmented environments built on top of WebArena.
Leaderboard: https://huggingface.co/spaces/McGill-NLP/safearena-leaderboard
We release our benchmark, code, tasks and environments to help researchers develop web agents that are not only helpful but also safe. You can request access now if you are working on web agents or safety:
Paper: https://arxiv.org/abs/2503.04957
Benchmark: https://safearena.github.io
Code: https://github.com/McGill-NLP/safearena
Tasks/Environments: https://huggingface.co/datasets/McGill-NLP/safearena
Leaderboard: https://huggingface.co/spaces/McGill-NLP/safearena-leaderboard
Discuss on socials:
X: https://x.com/xhluca/status/1899151463068135874
BlueSky: https://bsky.app/profile/xhluca.bsky.social/post/3lk2466xpfc23
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning (2025)
- Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis (2025)
- Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models (2025)
- Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks (2025)
- WebGames: Challenging General-Purpose Web-Browsing AI Agents (2025)
- CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation (2025)
- SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper