File size: 6,028 Bytes
3832f2c
38343fb
68191e6
38343fb
 
3832f2c
 
 
 
38343fb
68191e6
 
 
 
 
 
 
 
 
 
3832f2c
68191e6
 
 
 
 
 
 
 
38343fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
title: MAUVE
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  MAUVE is a library built on PyTorch and HuggingFace Transformers to measure
  the gap between neural text and human text with the eponymous MAUVE measure.


  MAUVE summarizes both Type I and Type II errors measured softly using
  Kullback–Leibler (KL) divergences.


  For details, see the MAUVE paper: https://arxiv.org/abs/2102.01454 (Neurips,
  2021).


  This metrics is a wrapper around the official implementation of MAUVE:

  https://github.com/krishnap25/mauve
---
# Metric Card for MAUVE

## Metric description

MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure. It summarizes both Type I and Type II errors measured softly using [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).

This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE.

For more details, consult the [MAUVE paper](https://arxiv.org/abs/2102.01454).


## How to use 

The metric takes two lists of strings of tokens separated by spaces: one representing `predictions` (i.e. the text generated by the model) and the second representing `references` (a reference text for each prediction):

```python
from evaluate import load
mauve = load('mauve')
predictions = ["hello world", "goodnight moon"]
references = ["hello world",  "goodnight moon"]
mauve_results = mauve.compute(predictions=predictions, references=references) 
```

It also has several optional arguments:

`num_buckets`: the size of the histogram to quantize P and Q. Options: `auto` (default) or an integer.

`pca_max_data`: the number data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. The default is `-1`.

`kmeans_explained_var`: amount of variance of the data to keep in dimensionality reduction by PCA. The default is `0.9`.

`kmeans_num_redo`: number of times to redo k-means clustering (the best objective is kept). The default is `5`.

`kmeans_max_iter`: maximum number of k-means iterations. The default is `500`.

`featurize_model_name`: name of the model from which features are obtained, from one of the following: `gpt2`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`. The default is `gpt2-large`.

`device_id`: Device for featurization. Supply a GPU id (e.g. `0` or `3`) to use GPU. If no GPU with this id is found, the metric will use CPU.

`max_text_length`: maximum number of tokens to consider. The default is `1024`.

`divergence_curve_discretization_size` Number of points to consider on the divergence curve. The default is `25`.

`mauve_scaling_factor`: Hyperparameter for scaling. The default is `5`.

`verbose`: If `True` (default), running the metric will print running time updates.

`seed`: random seed to initialize k-means cluster assignments, randomly assigned by default.
    


## Output values

This metric outputs a dictionary with 5 key-value pairs:

`mauve`: MAUVE score, which ranges between 0 and 1. **Larger** values indicate that P and Q are closer.

`frontier_integral`: Frontier Integral, which ranges between 0 and 1. **Smaller** values indicate that P and Q are closer.

`divergence_curve`: a numpy.ndarray of shape (m, 2); plot it with `matplotlib` to view the divergence curve.

`p_hist`: a discrete distribution, which is a quantized version of the text distribution `p_text`.
 
`q_hist`: same as above, but with `q_text`.


### Values from popular papers

The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain. The authors found that bigger models resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments.


## Examples 

Perfect match between prediction and reference:

```python
from evaluate import load
mauve = load('mauve')
predictions = ["hello world", "goodnight moon"]
references = ["hello world",  "goodnight moon"]
mauve_results = mauve.compute(predictions=predictions, references=references) 
print(mauve_results.mauve)
1.0
```

Partial match between prediction and reference:

```python
from evaluate import load
mauve = load('mauve')
predictions = ["hello world", "goodnight moon"]
references = ["hello there", "general kenobi"]
mauve_results = mauve.compute(predictions=predictions, references=references) 
print(mauve_results.mauve)
0.27811372536724027
```

## Limitations and bias

The [original MAUVE paper](https://arxiv.org/abs/2102.01454) did not analyze the inductive biases present in different embedding models, but related work has shown different kinds of biases exist in many popular generative language models including GPT-2 (see [Kirk et al., 2021](https://arxiv.org/pdf/2102.04130.pdf), [Abid et al., 2021](https://arxiv.org/abs/2101.05783)). The extent to which these biases can impact the MAUVE score has not been quantified.

Also, calculating the MAUVE metric involves downloading the model from which features are obtained -- the default model, `gpt2-large`, takes over 3GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `gpt` is 523MB.


## Citation

```bibtex
@inproceedings{pillutla-etal:mauve:neurips2021,
  title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},
  author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
  booktitle = {NeurIPS},
  year      = {2021}
}
```

## Further References 
- [Official MAUVE implementation](https://github.com/krishnap25/mauve)
- [Hugging Face Tasks - Text Generation](https://huggingface.co/tasks/text-generation)