arxiv:2410.06458

LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

Published on Oct 9

· Submitted by

thomas-ferraz on Oct 10

Upvote

Authors:

Thomas Palmeira Ferraz ,

Kartik Mehta ,

Sijia Liu ,

Mohit Bansal ,

Nanyun Peng

Abstract

Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs' ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM's response needs refinement. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.

View arXiv page View PDF Add to collection

Community

thomas-ferraz

Paper author Paper submitter 18 days ago

The authors propose DeCRIM (Decompose, Critique, and Refine), a novel self-correction pipeline designed to enhance Large Language Models (LLMs) in following instructions with multiple constraints. Recognizing the limitations of LLMs in handling such instructions, especially when real-world user constraints are involved, the authors introduce RealInstruct, a benchmark based on real user queries. Through their analysis, they reveal that even state-of-the-art models like GPT-4 struggle with constraint satisfaction, failing to meet at least one constraint in over 21% of cases. DeCRIM improves LLM performance by decomposing instructions into individual constraints and using a Critic model to refine the output where necessary.

To assess the efficacy of DeCRIM, the authors conduct experiments comparing both proprietary and open-source models. Their findings show that DeCRIM significantly enhances the performance of the open-source Mistral model, leading to a 7.3% improvement on the RealInstruct benchmark and 8.0% on the IFEval benchmark. These results hold even with weak feedback, and stronger feedback allows DeCRIM-enhanced open-source models to outperform GPT-4 on both benchmarks. The authors also explore model-based evaluation as a cost-effective alternative to human evaluation, finding that GPT-4-Turbo with Chain-of-Thought prompting provides reliable results.

In conclusion, the authors contribute to the field by introducing RealInstruct as a new real-world benchmark and DeCRIM as an effective self-correction pipeline for multi-constrained instruction following. They also present the first systematic analysis of model-based evaluation for constraint satisfaction. The proposed pipeline closes the performance gap between open-source and proprietary models, pushing the capabilities of LLMs in handling real-world, complex instructions.

librarian-bot

18 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.06458 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.06458 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.06458 in a Space README.md to link it from this page.