Papers
arxiv:2502.12485

Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages -- A Singlish Case Study

Published on Feb 18
Authors:
,
,
,

Abstract

To ensure safe usage, Large Language Models (LLMs) typically undergo alignment with human-defined values. However, this alignment often relies on primarily English data and is biased towards Western-centric values, limiting its effectiveness in low-resource language settings. In this paper, we describe our approach for aligning SEA-Lion-v2.1-Instruct (a Llama3-8B variant) to minimize toxicity in Singlish, an English creole specific to Singapore. We find that supervised fine-tuning and Kahneman-Tversky Optimization (KTO) on paired and un<PRE_TAG>paired preferences</POST_TAG> is more sample efficient and yields significantly better results than Direct Preference Optimization (DPO). Our analysis reveals that DPO implicitly enforces a weaker safety objective than KTO, and that SFT complements KTO by improving training stability. Finally, we introduce a simple but novel modification to KTO, KTO-S, which improves training stability through better gradient exploitation. Overall, we present a general approach for safety alignment conducive to low-resource English languages, successfully reducing toxicity by 99\% on our <PRE_TAG>Singlish benchmark</POST_TAG>, with gains generalizing to the broader TOXIGEN dataset while maintaining strong performance across standard LLM benchmarks.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.12485 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.12485 in a Space README.md to link it from this page.

Collections including this paper 1