mauve / README.md
lvwerra's picture
lvwerra HF staff
Update Space (evaluate main: e179b5b8)
8bf982f
|
raw
history blame
5.54 kB
---
title: MAUVE
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- metric
---
# Metric Card for MAUVE
## Metric description
MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure. It summarizes both Type I and Type II errors measured softly using [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).
This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE.
For more details, consult the [MAUVE paper](https://arxiv.org/abs/2102.01454).
## How to use
The metric takes two lists of strings of tokens separated by spaces: one representing `predictions` (i.e. the text generated by the model) and the second representing `references` (a reference text for each prediction):
```python
from evaluate import load
mauve = load('mauve')
predictions = ["hello world", "goodnight moon"]
references = ["hello world", "goodnight moon"]
mauve_results = mauve.compute(predictions=predictions, references=references)
```
It also has several optional arguments:
`num_buckets`: the size of the histogram to quantize P and Q. Options: `auto` (default) or an integer.
`pca_max_data`: the number data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. The default is `-1`.
`kmeans_explained_var`: amount of variance of the data to keep in dimensionality reduction by PCA. The default is `0.9`.
`kmeans_num_redo`: number of times to redo k-means clustering (the best objective is kept). The default is `5`.
`kmeans_max_iter`: maximum number of k-means iterations. The default is `500`.
`featurize_model_name`: name of the model from which features are obtained, from one of the following: `gpt2`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`. The default is `gpt2-large`.
`device_id`: Device for featurization. Supply a GPU id (e.g. `0` or `3`) to use GPU. If no GPU with this id is found, the metric will use CPU.
`max_text_length`: maximum number of tokens to consider. The default is `1024`.
`divergence_curve_discretization_size` Number of points to consider on the divergence curve. The default is `25`.
`mauve_scaling_factor`: Hyperparameter for scaling. The default is `5`.
`verbose`: If `True` (default), running the metric will print running time updates.
`seed`: random seed to initialize k-means cluster assignments, randomly assigned by default.
## Output values
This metric outputs a dictionary with 5 key-value pairs:
`mauve`: MAUVE score, which ranges between 0 and 1. **Larger** values indicate that P and Q are closer.
`frontier_integral`: Frontier Integral, which ranges between 0 and 1. **Smaller** values indicate that P and Q are closer.
`divergence_curve`: a numpy.ndarray of shape (m, 2); plot it with `matplotlib` to view the divergence curve.
`p_hist`: a discrete distribution, which is a quantized version of the text distribution `p_text`.
`q_hist`: same as above, but with `q_text`.
### Values from popular papers
The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain. The authors found that bigger models resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments.
## Examples
Perfect match between prediction and reference:
```python
from evaluate import load
mauve = load('mauve')
predictions = ["hello world", "goodnight moon"]
references = ["hello world", "goodnight moon"]
mauve_results = mauve.compute(predictions=predictions, references=references)
print(mauve_results.mauve)
1.0
```
Partial match between prediction and reference:
```python
from evaluate import load
mauve = load('mauve')
predictions = ["hello world", "goodnight moon"]
references = ["hello there", "general kenobi"]
mauve_results = mauve.compute(predictions=predictions, references=references)
print(mauve_results.mauve)
0.27811372536724027
```
## Limitations and bias
The [original MAUVE paper](https://arxiv.org/abs/2102.01454) did not analyze the inductive biases present in different embedding models, but related work has shown different kinds of biases exist in many popular generative language models including GPT-2 (see [Kirk et al., 2021](https://arxiv.org/pdf/2102.04130.pdf), [Abid et al., 2021](https://arxiv.org/abs/2101.05783)). The extent to which these biases can impact the MAUVE score has not been quantified.
Also, calculating the MAUVE metric involves downloading the model from which features are obtained -- the default model, `gpt2-large`, takes over 3GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `gpt` is 523MB.
## Citation
```bibtex
@inproceedings{pillutla-etal:mauve:neurips2021,
title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},
author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
booktitle = {NeurIPS},
year = {2021}
}
```
## Further References
- [Official MAUVE implementation](https://github.com/krishnap25/mauve)
- [Hugging Face Tasks - Text Generation](https://huggingface.co/tasks/text-generation)