How to build a custom text classifier without days of human labeling

Community Article Published October 17, 2024

LLMs work reasonably well as text classifiers when provided with a few examples. However, they are expensive to run at scale, not feasible for some real-world scenarios, and their performance drops in specialized domains. On the other hand, purpose-built classifiers have low latency and can potentially run on CPU, but they require labeled training data.

In this post, we show how to train a classifier with an LLM’s feedback to deploy a small model without days of human labeling.

Who this is for

  • Data scientists who require efficient labeling of domain data.
  • ML engineers with resource constraints in need of effective models.
  • Newcomers who want to get familiar with applied machine learning.

image/png

High-quality training data is critical in developing reliable models. While LLMs help reduce the amount of human inputs to train an ML model, adding human feedback can significantly increase the quality of the model. One tool that simplifies this process is the distilabel library, which leverages LLMs to supercharge labeling workflows. For text classification, the ArgillaLabeller task uses an LLM to label datasets hosted on Argilla, a modern, open-source, data-centric tool to improve AI datasets. This integration combines the best of both worlds: the automatic labeling capabilities of LLMs and the high-quality annotations from human experts.

This post explains how to build a text classification model combining LLMs and human feedback, drastically reducing the time to deploy a supervised model for a specialized use case.

The post details the following steps:

  • Autolabel a dataset by defining its fields, labels, and annotation guidelines.
  • Human review and improvement of LLM suggested labels.
  • Train a specialized SetFit model and compare it against an LLM few-shot classifier.

Define your dataset

The first step is to configure your Argilla dataset. This means you need to define the fields that will contain the data to be annotated, the labels, and the guidelines. You have full flexibility to customize these elements for your use case so they align with your project goals.

In this example, we are defining a dataset for sentiment classification:

settings = rg.Settings(
    fields=[
        rg.TextField(
            name="text",
            title="Text",
            description="Provide a concise response to the prompt",
        )
    ],
    questions=[
        rg.LabelQuestion(
            name="label",
            title="Emotion",
            description="Provide a single label for the sentiment of the text",
            labels=["positive", "neutral", "negative"],
        )
    ],
    guidelines = "We are dealing with input data about detailed customer reviews of 3 corresponding labels: positive, negative, and neutral."
)

Next, you can create the dataset with the defined settings and add the data for annotation. In this example, we will use the argilla/pc-components-reviews dataset from the Hub.

dataset = rg.Dataset.from_hub(
    repo_id="argilla/pc-components-reviews",
    name="pc-components",
    split="train",
    with_records=True,
    settings=settings,
)

In the next step, distilabel will use this dataset and its configuration to auto-label it using an LLM. This means that, under the hood, distilabel will retrieve Argilla’s data and format it into a prompt template to guide the LLM in understanding the labeling task. You can find the prompt template in the Appendix for reference.

Automatic labeling

The next step is to run distilabel to start auto-labeling, as shown in the video below:

To label records with an LLM, you need to set up ArgillaLabeller with the sentiment classification dataset. Optionally, you can add some example records for a few-shot setting.

labeller = ArgillaLabeller(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Llama-3.1-8B-Instruct",
        tokenizer_id="meta-llama/Llama-3.1-8B-Instruct"
    ),
    example_records=example_records,
)
labeller.load()

Once you run your ArgillaLabeller with your dataset, it will automatically label it. The suggested labels will be available in the Argilla UI for human review.

result = next(
    labeller.process(
        [
            {
                "record": record,
                "fields": dataset.fields,
                "question": dataset.questions[0],
                "guidelines": dataset.guidelines,
            }
        ]
    )
)
record.suggestions.add(rg.Suggestion(**suggestion["suggestion"]))
dataset.records.log(record)

This approach simplifies the labeling process, allowing you to automatically label data in real-time while refining it with human feedback. This method is powerful and versatile, providing good performance right out of the box, especially for straightforward workflows. By automating much of the process, you can save time and effort, avoiding the need for days and days of manual labeling. However, at the same time, it ensures that humans are still in the loop, making it an efficient way to build high-quality datasets for your project.

From LLMs to smaller models

The final step is to train our specialized model with the annotated data. This data has been auto-labeled by an LLM and then improved with human feedback via the Argilla UI. The goal is to train a small, specialized model that’s optimized for your use case, without relying on an LLM for inference. In this example, we will be using SetFit, a powerful text classification library.

If you are unfamiliar with the library, take a look at their GitHub repository here. SetFit achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples 🤯!

SetFit first fine-tunes a sentence transformer model to generate new embeddings that align better with our specific classification task using a contrastive learning approach. Next, the classifier is trained using the fine-tuned embedding model to input the data and map them to the corresponding classes.

With just a few lines of code and the annotated data from Argilla, you can easily train your classifier. With this approach, you will have successfully trained a working classifier using domain-specific high-quality data in an efficient way.


model = SetFitModel.from_pretrained("TaylorAI/bge-micro-v2")

trainer = Trainer(
    model=model,
    train_dataset=train_annotated_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

Pure LLM labeling vs SetFit labeling

In the previous sections, we demonstrated how to use an LLM to label a dataset and get enough annotated samples to feed and train a smaller model. However, there may still be questions regarding the performance and accuracy of each model. To assess this, we conducted an experiment comparing both approaches. Specifically, we used the argilla/pc-components-reviews dataset and compared the performance of both models using a different number of samples.

First, we used an LLM (LlaMA 3.1 8b) to classify product reviews as positive, negative, or neutral sentiment. The LLM had a description of the task from the annotation guidelines and an incrementing set of examples per class from 0 to 6. Then, we trained a classifier with the SetFit library using the bge-micro-v2 model. This approach is not compatible with prompts, but we used an incrementing set of samples per class to train the classifier, again from 0 to 6. In both cases, we measured the accuracy of the remaining 122 samples from the pc-component-review dataset.

In total, we reviewed 18 samples for a training set in less than 10 minutes, with help from LLM suggestions. In a real-world use case, we would expand the evaluation set to improve robustness.

As shown below, the outcomes of our experiment suggest that the SetFit model outperforms the zero-shot LLama-3.1-8B, especially for few-shot classification. This highlights the value of LLMs for initial data annotation, where they generate high-quality synthetic labels. Once the initial dataset is annotated, smaller models like SetFit can further refine and learn from this data, providing improved performance. This approach accelerates model development and allows a more efficient deployment of classifiers in resource-constrained environments.

image/png

Conclusions

In this blog post, we explored how to integrate Argilla and distilabel to train a small classifier using both AI and human feedback. We demonstrated how applying this approach to auto-label a dataset can simplify workflows, saving time and resources while still delivering a high-performing model. We found several key advantages:

  • Increased labeling efficiency through the use of LLMs
  • Consistency in labeling due to well-defined annotation guidelines
  • Combining AI feedback with human revision combines the scale of AI models with the precision of human annotators.

What’s next?

If you want to try out automatic labeling on your own use case, these are your next steps:

Appendix

Prompt Template

Please provide an answer to the question based on the input fields{% if examples %} and examples{% endif %}.
{% if guidelines %}
# Guidelines
{{ guidelines }}
{% endif %}
# Input Fields
{{ fields }}
# Question
{{ question }}
{% if examples %}
# Examples
{{ examples }}
{% endif %}

References