Edit model card

Open Food Facts - Ingredients spellcheck model

When a product is added to the database, all its details, such as allergens, additives, or nutritional values, are either wrote down by the contributor, or automatically extracted from the product pictures using OCR.

However, it often happens the information extracted by OCR contains typos and errors due to bad quality pictures: low-definition, curved product, light reflection, etc...

To solve this problem, we developed an Ingredient Spellcheck ๐ŸŠ, a model capable of correcting typos in a list of ingredients following a defined guideline. The model, based on [Mistral-7B-v0.3], was fine-tuned on thousand of corrected lists of ingredients extracted from the database.

Model Details

Model Description

The Open Food Facts Ingredients Spellcheck is a version of Mistral-7B-v0.3 fine-tuned on thousands of corrected list of ingredients extracted from the OFF database.

The training dataset, with the evaluation benchmark are available in the Open Food Facts HF repository:

The project is currently in development. You can find it in the Open Food Facts Github repo.

A demo of this model is also available in HF Spaces.

Uses

This model takes a list of ingredients of a product as input and returns the correction.

It follows a spellcheck guideline, which was used to build the training and evaluation datasets. You can find this guideline in the Spellcheck project README.

To respect the training process, the input list of ingredients needs to be embedded into the following prompt:

def prepare_instruction(text: str) -> str:
    """Prepare instruction prompt for fine-tuning and inference.
    Identical to instruction during training.

    Args:
        text (str): List of ingredients

    Returns:
        str: Instruction.
    """
    instruction = (
        "###Correct the list of ingredients:\n"
        + text
        + "\n\n###Correction:\n"
    )
    return instruction

Training Details

The model training informations are available in the CometML Experiment Tracker, along the other experimentations.

The model was trained on AWS Sagemaker using an ml.g5.2xlarge instance for 3 epochs.

Evaluation

The model is evaluated on the benchmark using a custom evaluation algorithm.

In short, lists of ingredients are separated into 3 parts: original, reference, prediction. Using a sequence alignement algorithm between respectively original-reference and original-*prediction, we are able to tell which token were supposed to be corrected, and which one was actually corrected. This leads to a correction Precision and Recall.

The complete explanation of the algorithm is available in the Spellchech README.

Metrics:

  • Correction precision: 0.67
  • Correction recall: 0.62
  • Localisation precision: 0.75
  • Localisation recall: 0.69

Additional links:

Downloads last month
32
Safetensors
Model size
7.25B params
Tensor type
FP16
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for openfoodfacts/spellcheck-mistral-7b

Finetuned
(52)
this model

Datasets used to train openfoodfacts/spellcheck-mistral-7b

Space using openfoodfacts/spellcheck-mistral-7b 1