--- title: MAUVE emoji: 🤗 colorFrom: blue colorTo: red sdk: gradio sdk_version: 3.0.2 app_file: app.py pinned: false tags: - evaluate - metric description: >- MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure. MAUVE summarizes both Type I and Type II errors measured softly using Kullback–Leibler (KL) divergences. For details, see the MAUVE paper: https://arxiv.org/abs/2102.01454 (Neurips, 2021). This metrics is a wrapper around the official implementation of MAUVE: https://github.com/krishnap25/mauve --- # Metric Card for MAUVE ## Metric description MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure. It summarizes both Type I and Type II errors measured softly using [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence). This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE. For more details, consult the [MAUVE paper](https://arxiv.org/abs/2102.01454). ## How to use The metric takes two lists of strings of tokens separated by spaces: one representing `predictions` (i.e. the text generated by the model) and the second representing `references` (a reference text for each prediction): ```python from evaluate import load mauve = load('mauve') predictions = ["hello world", "goodnight moon"] references = ["hello world", "goodnight moon"] mauve_results = mauve.compute(predictions=predictions, references=references) ``` It also has several optional arguments: `num_buckets`: the size of the histogram to quantize P and Q. Options: `auto` (default) or an integer. `pca_max_data`: the number data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. The default is `-1`. `kmeans_explained_var`: amount of variance of the data to keep in dimensionality reduction by PCA. The default is `0.9`. `kmeans_num_redo`: number of times to redo k-means clustering (the best objective is kept). The default is `5`. `kmeans_max_iter`: maximum number of k-means iterations. The default is `500`. `featurize_model_name`: name of the model from which features are obtained, from one of the following: `gpt2`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`. The default is `gpt2-large`. `device_id`: Device for featurization. Supply a GPU id (e.g. `0` or `3`) to use GPU. If no GPU with this id is found, the metric will use CPU. `max_text_length`: maximum number of tokens to consider. The default is `1024`. `divergence_curve_discretization_size` Number of points to consider on the divergence curve. The default is `25`. `mauve_scaling_factor`: Hyperparameter for scaling. The default is `5`. `verbose`: If `True` (default), running the metric will print running time updates. `seed`: random seed to initialize k-means cluster assignments, randomly assigned by default. ## Output values This metric outputs a dictionary with 5 key-value pairs: `mauve`: MAUVE score, which ranges between 0 and 1. **Larger** values indicate that P and Q are closer. `frontier_integral`: Frontier Integral, which ranges between 0 and 1. **Smaller** values indicate that P and Q are closer. `divergence_curve`: a numpy.ndarray of shape (m, 2); plot it with `matplotlib` to view the divergence curve. `p_hist`: a discrete distribution, which is a quantized version of the text distribution `p_text`. `q_hist`: same as above, but with `q_text`. ### Values from popular papers The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain. The authors found that bigger models resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments. ## Examples Perfect match between prediction and reference: ```python from evaluate import load mauve = load('mauve') predictions = ["hello world", "goodnight moon"] references = ["hello world", "goodnight moon"] mauve_results = mauve.compute(predictions=predictions, references=references) print(mauve_results.mauve) 1.0 ``` Partial match between prediction and reference: ```python from evaluate import load mauve = load('mauve') predictions = ["hello world", "goodnight moon"] references = ["hello there", "general kenobi"] mauve_results = mauve.compute(predictions=predictions, references=references) print(mauve_results.mauve) 0.27811372536724027 ``` ## Limitations and bias The [original MAUVE paper](https://arxiv.org/abs/2102.01454) did not analyze the inductive biases present in different embedding models, but related work has shown different kinds of biases exist in many popular generative language models including GPT-2 (see [Kirk et al., 2021](https://arxiv.org/pdf/2102.04130.pdf), [Abid et al., 2021](https://arxiv.org/abs/2101.05783)). The extent to which these biases can impact the MAUVE score has not been quantified. Also, calculating the MAUVE metric involves downloading the model from which features are obtained -- the default model, `gpt2-large`, takes over 3GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `gpt` is 523MB. ## Citation ```bibtex @inproceedings{pillutla-etal:mauve:neurips2021, title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers}, author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid}, booktitle = {NeurIPS}, year = {2021} } ``` ## Further References - [Official MAUVE implementation](https://github.com/krishnap25/mauve) - [Hugging Face Tasks - Text Generation](https://huggingface.co/tasks/text-generation)