Spaces:

evaluate-metric
/

mauve

Runtime error

App Files Files Community

mauve / README.md

lvwerra HF staff

Update Space (evaluate main: e179b5b8)

8bf982f over 2 years ago

preview code

raw

history blame

5.54 kB

	---
	title: MAUVE
	emoji: 🤗
	colorFrom: blue
	colorTo: red
	sdk: gradio
	sdk_version: 3.0.2
	app_file: app.py
	pinned: false
	tags:
	- evaluate
	- metric
	---

	# Metric Card for MAUVE

	## Metric description

	MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure. It summarizes both Type I and Type II errors measured softly using [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).

	This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE.

	For more details, consult the [MAUVE paper](https://arxiv.org/abs/2102.01454).


	## How to use

	The metric takes two lists of strings of tokens separated by spaces: one representing `predictions` (i.e. the text generated by the model) and the second representing `references` (a reference text for each prediction):

	```python
	from evaluate import load
	mauve = load('mauve')
	predictions = ["hello world", "goodnight moon"]
	references = ["hello world", "goodnight moon"]
	mauve_results = mauve.compute(predictions=predictions, references=references)
	```

	It also has several optional arguments:

	`num_buckets`: the size of the histogram to quantize P and Q. Options: `auto` (default) or an integer.

	`pca_max_data`: the number data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. The default is `-1`.

	`kmeans_explained_var`: amount of variance of the data to keep in dimensionality reduction by PCA. The default is `0.9`.

	`kmeans_num_redo`: number of times to redo k-means clustering (the best objective is kept). The default is `5`.

	`kmeans_max_iter`: maximum number of k-means iterations. The default is `500`.

	`featurize_model_name`: name of the model from which features are obtained, from one of the following: `gpt2`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`. The default is `gpt2-large`.

	`device_id`: Device for featurization. Supply a GPU id (e.g. `0` or `3`) to use GPU. If no GPU with this id is found, the metric will use CPU.

	`max_text_length`: maximum number of tokens to consider. The default is `1024`.

	`divergence_curve_discretization_size` Number of points to consider on the divergence curve. The default is `25`.

	`mauve_scaling_factor`: Hyperparameter for scaling. The default is `5`.

	`verbose`: If `True` (default), running the metric will print running time updates.

	`seed`: random seed to initialize k-means cluster assignments, randomly assigned by default.



	## Output values

	This metric outputs a dictionary with 5 key-value pairs:

	`mauve`: MAUVE score, which ranges between 0 and 1. Larger values indicate that P and Q are closer.

	`frontier_integral`: Frontier Integral, which ranges between 0 and 1. Smaller values indicate that P and Q are closer.

	`divergence_curve`: a numpy.ndarray of shape (m, 2); plot it with `matplotlib` to view the divergence curve.

	`p_hist`: a discrete distribution, which is a quantized version of the text distribution `p_text`.

	`q_hist`: same as above, but with `q_text`.


	### Values from popular papers

	The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain. The authors found that bigger models resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments.


	## Examples

	Perfect match between prediction and reference:

	```python
	from evaluate import load
	mauve = load('mauve')
	predictions = ["hello world", "goodnight moon"]
	references = ["hello world", "goodnight moon"]
	mauve_results = mauve.compute(predictions=predictions, references=references)
	print(mauve_results.mauve)
	1.0
	```

	Partial match between prediction and reference:

	```python
	from evaluate import load
	mauve = load('mauve')
	predictions = ["hello world", "goodnight moon"]
	references = ["hello there", "general kenobi"]
	mauve_results = mauve.compute(predictions=predictions, references=references)
	print(mauve_results.mauve)
	0.27811372536724027
	```

	## Limitations and bias

	The [original MAUVE paper](https://arxiv.org/abs/2102.01454) did not analyze the inductive biases present in different embedding models, but related work has shown different kinds of biases exist in many popular generative language models including GPT-2 (see [Kirk et al., 2021](https://arxiv.org/pdf/2102.04130.pdf), [Abid et al., 2021](https://arxiv.org/abs/2101.05783)). The extent to which these biases can impact the MAUVE score has not been quantified.

	Also, calculating the MAUVE metric involves downloading the model from which features are obtained -- the default model, `gpt2-large`, takes over 3GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `gpt` is 523MB.


	## Citation

	```bibtex
	@inproceedings{pillutla-etal:mauve:neurips2021,
	title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},
	author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
	booktitle = {NeurIPS},
	year = {2021}
	}
	```

	## Further References
	- [Official MAUVE implementation](https://github.com/krishnap25/mauve)
	- [Hugging Face Tasks - Text Generation](https://huggingface.co/tasks/text-generation)