lvwerra HF staff commited on
Commit
41fbf77
1 Parent(s): 3e9d8c5

Update Space (evaluate main: 19f1f9a1)

Browse files
Files changed (2) hide show
  1. README.md +25 -22
  2. requirements.txt +1 -1
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  title: MAUVE
3
- emoji: 🤗
4
  colorFrom: blue
5
  colorTo: red
6
  sdk: gradio
@@ -11,28 +11,23 @@ tags:
11
  - evaluate
12
  - metric
13
  description: >-
14
- MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure.
15
-
16
- MAUVE summarizes both Type I and Type II errors measured softly using Kullback–Leibler (KL) divergences.
17
-
18
- For details, see the MAUVE paper: https://arxiv.org/abs/2102.01454 (Neurips, 2021).
19
-
20
- This metrics is a wrapper around the official implementation of MAUVE:
21
- https://github.com/krishnap25/mauve
22
  ---
23
 
24
  # Metric Card for MAUVE
25
 
26
  ## Metric description
27
 
28
- MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure. It summarizes both Type I and Type II errors measured softly using [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).
29
 
30
  This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE.
31
 
32
  For more details, consult the [MAUVE paper](https://arxiv.org/abs/2102.01454).
33
 
34
-
35
- ## How to use
36
 
37
  The metric takes two lists of strings of tokens separated by spaces: one representing `predictions` (i.e. the text generated by the model) and the second representing `references` (a reference text for each prediction):
38
 
@@ -41,16 +36,16 @@ from evaluate import load
41
  mauve = load('mauve')
42
  predictions = ["hello world", "goodnight moon"]
43
  references = ["hello world", "goodnight moon"]
44
- mauve_results = mauve.compute(predictions=predictions, references=references)
45
  ```
46
 
47
  It also has several optional arguments:
48
 
49
  `num_buckets`: the size of the histogram to quantize P and Q. Options: `auto` (default) or an integer.
50
 
51
- `pca_max_data`: the number data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. The default is `-1`.
52
 
53
- `kmeans_explained_var`: amount of variance of the data to keep in dimensionality reduction by PCA. The default is `0.9`.
54
 
55
  `kmeans_num_redo`: number of times to redo k-means clustering (the best objective is kept). The default is `5`.
56
 
@@ -89,10 +84,10 @@ This metric outputs a dictionary with 5 key-value pairs:
89
 
90
  ### Values from popular papers
91
 
92
- The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain. The authors found that bigger models resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments.
93
 
94
 
95
- ## Examples
96
 
97
  Perfect match between prediction and reference:
98
 
@@ -101,7 +96,7 @@ from evaluate import load
101
  mauve = load('mauve')
102
  predictions = ["hello world", "goodnight moon"]
103
  references = ["hello world", "goodnight moon"]
104
- mauve_results = mauve.compute(predictions=predictions, references=references)
105
  print(mauve_results.mauve)
106
  1.0
107
  ```
@@ -113,7 +108,7 @@ from evaluate import load
113
  mauve = load('mauve')
114
  predictions = ["hello world", "goodnight moon"]
115
  references = ["hello there", "general kenobi"]
116
- mauve_results = mauve.compute(predictions=predictions, references=references)
117
  print(mauve_results.mauve)
118
  0.27811372536724027
119
  ```
@@ -122,7 +117,15 @@ print(mauve_results.mauve)
122
 
123
  The [original MAUVE paper](https://arxiv.org/abs/2102.01454) did not analyze the inductive biases present in different embedding models, but related work has shown different kinds of biases exist in many popular generative language models including GPT-2 (see [Kirk et al., 2021](https://arxiv.org/pdf/2102.04130.pdf), [Abid et al., 2021](https://arxiv.org/abs/2101.05783)). The extent to which these biases can impact the MAUVE score has not been quantified.
124
 
125
- Also, calculating the MAUVE metric involves downloading the model from which features are obtained -- the default model, `gpt2-large`, takes over 3GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `gpt` is 523MB.
 
 
 
 
 
 
 
 
126
 
127
 
128
  ## Citation
@@ -132,10 +135,10 @@ Also, calculating the MAUVE metric involves downloading the model from which fea
132
  title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},
133
  author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
134
  booktitle = {NeurIPS},
135
- year = {2021}
136
  }
137
  ```
138
 
139
- ## Further References
140
  - [Official MAUVE implementation](https://github.com/krishnap25/mauve)
141
  - [Hugging Face Tasks - Text Generation](https://huggingface.co/tasks/text-generation)
 
1
  ---
2
  title: MAUVE
3
+ emoji: 🤗
4
  colorFrom: blue
5
  colorTo: red
6
  sdk: gradio
 
11
  - evaluate
12
  - metric
13
  description: >-
14
+ MAUVE is a measure of the statistical gap between two text distributions, e.g., how far the text written by a model is the distribution of human text, using samples from both distributions.
15
+
16
+ MAUVE is obtained by computing Kullback–Leibler (KL) divergences between the two distributions in a quantized embedding space of a large language model. It can quantify differences in the quality of generated text based on the size of the model, the decoding algorithm, and the length of the generated text. MAUVE was found to correlate the strongest with human evaluations over baseline metrics for open-ended text generation.
17
+
 
 
 
 
18
  ---
19
 
20
  # Metric Card for MAUVE
21
 
22
  ## Metric description
23
 
24
+ MAUVE is a measure of the gap between neural text and human text. It is computed using the [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) between the two distributions of text in a quantized embedding space of a large language model. MAUVE can identify differences in quality arising from model sizes and decoding algorithms.
25
 
26
  This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE.
27
 
28
  For more details, consult the [MAUVE paper](https://arxiv.org/abs/2102.01454).
29
 
30
+ ## How to use
 
31
 
32
  The metric takes two lists of strings of tokens separated by spaces: one representing `predictions` (i.e. the text generated by the model) and the second representing `references` (a reference text for each prediction):
33
 
 
36
  mauve = load('mauve')
37
  predictions = ["hello world", "goodnight moon"]
38
  references = ["hello world", "goodnight moon"]
39
+ mauve_results = mauve.compute(predictions=predictions, references=references)
40
  ```
41
 
42
  It also has several optional arguments:
43
 
44
  `num_buckets`: the size of the histogram to quantize P and Q. Options: `auto` (default) or an integer.
45
 
46
+ `pca_max_data`: the number of data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. The default is `-1`.
47
 
48
+ `kmeans_explained_var`: the amount of variance of the data to keep in dimensionality reduction by PCA. The default is `0.9`.
49
 
50
  `kmeans_num_redo`: number of times to redo k-means clustering (the best objective is kept). The default is `5`.
51
 
 
84
 
85
  ### Values from popular papers
86
 
87
+ The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain. The authors found that bigger models resulted in higher MAUVE scores and that MAUVE is correlated with human judgments.
88
 
89
 
90
+ ## Examples
91
 
92
  Perfect match between prediction and reference:
93
 
 
96
  mauve = load('mauve')
97
  predictions = ["hello world", "goodnight moon"]
98
  references = ["hello world", "goodnight moon"]
99
+ mauve_results = mauve.compute(predictions=predictions, references=references)
100
  print(mauve_results.mauve)
101
  1.0
102
  ```
 
108
  mauve = load('mauve')
109
  predictions = ["hello world", "goodnight moon"]
110
  references = ["hello there", "general kenobi"]
111
+ mauve_results = mauve.compute(predictions=predictions, references=references)
112
  print(mauve_results.mauve)
113
  0.27811372536724027
114
  ```
 
117
 
118
  The [original MAUVE paper](https://arxiv.org/abs/2102.01454) did not analyze the inductive biases present in different embedding models, but related work has shown different kinds of biases exist in many popular generative language models including GPT-2 (see [Kirk et al., 2021](https://arxiv.org/pdf/2102.04130.pdf), [Abid et al., 2021](https://arxiv.org/abs/2101.05783)). The extent to which these biases can impact the MAUVE score has not been quantified.
119
 
120
+ Also, calculating the MAUVE metric involves downloading the model from which features are obtained -- the default model, `gpt2-large`, takes over 3GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance, `gpt` is 523MB.
121
+
122
+ It is a good idea to use at least 1000 samples for each distribution to compute MAUVE (the original paper uses 5000).
123
+
124
+ MAUVE is unable to identify very small differences between different settings of generation (e.g., between top-p sampling with p=0.95 versus 0.96). It is important, therefore, to account for the randomness inside the generation (e.g., due to sampling) and within the MAUVE estimation procedure (see the `seed` parameter above). Concretely, it is a good idea to obtain generations using multiple random seeds and/or to use rerun MAUVE with multiple values of the parameter `seed`.
125
+
126
+ For MAUVE to be large, the model distribution must be close to the human text distribution as seen by the embeddings. It is possible to have high-quality model text that still has a small MAUVE score (i.e., large gap) if it contains text about different topics/subjects, or uses a different writing style or vocabulary, or contains texts of a different length distribution. MAUVE summarizes the statistical gap (as measured by the large language model embeddings) --- this includes all these factors in addition to the quality-related aspects such as grammaticality.
127
+
128
+ See the [official implementation](https://github.com/krishnap25/mauve#best-practices-for-mauve) for more details about best practices.
129
 
130
 
131
  ## Citation
 
135
  title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},
136
  author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
137
  booktitle = {NeurIPS},
138
+ year = {2021}
139
  }
140
  ```
141
 
142
+ ## Further References
143
  - [Official MAUVE implementation](https://github.com/krishnap25/mauve)
144
  - [Hugging Face Tasks - Text Generation](https://huggingface.co/tasks/text-generation)
requirements.txt CHANGED
@@ -1,4 +1,4 @@
1
- git+https://github.com/huggingface/evaluate@28144191c9b78b67c15f50527b55a18d9cb6e1e6
2
  faiss-cpu
3
  scikit-learn
4
  mauve-text
 
1
+ git+https://github.com/huggingface/evaluate@19f1f9a1e76aa7aa3c8ee50022a33a200a3467b0
2
  faiss-cpu
3
  scikit-learn
4
  mauve-text