Spaces:
Runtime error
Runtime error
title: MAUVE | |
emoji: 🤗 | |
colorFrom: blue | |
colorTo: red | |
sdk: gradio | |
sdk_version: 3.0.2 | |
app_file: app.py | |
pinned: false | |
tags: | |
- evaluate | |
- metric | |
# Metric Card for MAUVE | |
## Metric description | |
MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure. It summarizes both Type I and Type II errors measured softly using [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence). | |
This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE. | |
For more details, consult the [MAUVE paper](https://arxiv.org/abs/2102.01454). | |
## How to use | |
The metric takes two lists of strings of tokens separated by spaces: one representing `predictions` (i.e. the text generated by the model) and the second representing `references` (a reference text for each prediction): | |
```python | |
from evaluate import load | |
mauve = load('mauve') | |
predictions = ["hello world", "goodnight moon"] | |
references = ["hello world", "goodnight moon"] | |
mauve_results = mauve.compute(predictions=predictions, references=references) | |
``` | |
It also has several optional arguments: | |
`num_buckets`: the size of the histogram to quantize P and Q. Options: `auto` (default) or an integer. | |
`pca_max_data`: the number data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. The default is `-1`. | |
`kmeans_explained_var`: amount of variance of the data to keep in dimensionality reduction by PCA. The default is `0.9`. | |
`kmeans_num_redo`: number of times to redo k-means clustering (the best objective is kept). The default is `5`. | |
`kmeans_max_iter`: maximum number of k-means iterations. The default is `500`. | |
`featurize_model_name`: name of the model from which features are obtained, from one of the following: `gpt2`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`. The default is `gpt2-large`. | |
`device_id`: Device for featurization. Supply a GPU id (e.g. `0` or `3`) to use GPU. If no GPU with this id is found, the metric will use CPU. | |
`max_text_length`: maximum number of tokens to consider. The default is `1024`. | |
`divergence_curve_discretization_size` Number of points to consider on the divergence curve. The default is `25`. | |
`mauve_scaling_factor`: Hyperparameter for scaling. The default is `5`. | |
`verbose`: If `True` (default), running the metric will print running time updates. | |
`seed`: random seed to initialize k-means cluster assignments, randomly assigned by default. | |
## Output values | |
This metric outputs a dictionary with 5 key-value pairs: | |
`mauve`: MAUVE score, which ranges between 0 and 1. **Larger** values indicate that P and Q are closer. | |
`frontier_integral`: Frontier Integral, which ranges between 0 and 1. **Smaller** values indicate that P and Q are closer. | |
`divergence_curve`: a numpy.ndarray of shape (m, 2); plot it with `matplotlib` to view the divergence curve. | |
`p_hist`: a discrete distribution, which is a quantized version of the text distribution `p_text`. | |
`q_hist`: same as above, but with `q_text`. | |
### Values from popular papers | |
The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain. The authors found that bigger models resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments. | |
## Examples | |
Perfect match between prediction and reference: | |
```python | |
from evaluate import load | |
mauve = load('mauve') | |
predictions = ["hello world", "goodnight moon"] | |
references = ["hello world", "goodnight moon"] | |
mauve_results = mauve.compute(predictions=predictions, references=references) | |
print(mauve_results.mauve) | |
1.0 | |
``` | |
Partial match between prediction and reference: | |
```python | |
from evaluate import load | |
mauve = load('mauve') | |
predictions = ["hello world", "goodnight moon"] | |
references = ["hello there", "general kenobi"] | |
mauve_results = mauve.compute(predictions=predictions, references=references) | |
print(mauve_results.mauve) | |
0.27811372536724027 | |
``` | |
## Limitations and bias | |
The [original MAUVE paper](https://arxiv.org/abs/2102.01454) did not analyze the inductive biases present in different embedding models, but related work has shown different kinds of biases exist in many popular generative language models including GPT-2 (see [Kirk et al., 2021](https://arxiv.org/pdf/2102.04130.pdf), [Abid et al., 2021](https://arxiv.org/abs/2101.05783)). The extent to which these biases can impact the MAUVE score has not been quantified. | |
Also, calculating the MAUVE metric involves downloading the model from which features are obtained -- the default model, `gpt2-large`, takes over 3GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `gpt` is 523MB. | |
## Citation | |
```bibtex | |
@inproceedings{pillutla-etal:mauve:neurips2021, | |
title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers}, | |
author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid}, | |
booktitle = {NeurIPS}, | |
year = {2021} | |
} | |
``` | |
## Further References | |
- [Official MAUVE implementation](https://github.com/krishnap25/mauve) | |
- [Hugging Face Tasks - Text Generation](https://huggingface.co/tasks/text-generation) | |