Spaces:

evaluate-metric
/

mauve

Running

App Files Files Community

krishnap25 commited on Nov 7, 2022

Commit

b65d16c

1 Parent(s): ed21711

Update the description in README

Browse files

Files changed (1) hide show

README.md +15 -6

README.md CHANGED Viewed

@@ -11,11 +11,13 @@ tags:
 - evaluate
 - metric
 description: >-
-  MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure.
-  MAUVE summarizes both Type I and Type II errors measured softly using Kullback–Leibler (KL) divergences.
-  For details, see the MAUVE paper: https://arxiv.org/abs/2102.01454 (Neurips, 2021).
   This metrics is a wrapper around the official implementation of MAUVE:
   https://github.com/krishnap25/mauve
@@ -25,7 +27,7 @@ description: >-
 ## Metric description
-MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure. It summarizes both Type I and Type II errors measured softly using [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).
 This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE.
@@ -69,7 +71,6 @@ It also has several optional arguments:
 `verbose`: If `True` (default), running the metric will print running time updates.
 `seed`: random seed to initialize k-means cluster assignments, randomly assigned by default.
 ## Output values
@@ -89,7 +90,15 @@ This metric outputs a dictionary with 5 key-value pairs:
 ### Values from popular papers
-The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain. The authors found that bigger models resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments.
 ## Examples

 - evaluate
 - metric
 description: >-
+  MAUVE is a measure of the gap between two text distributions, e.g., how far the text written by a model is the distribution of human text, using samples from both distributions.
+  MAUVE takes values between 0 (completely different distributions) and 1 (identical distributions).
+  MAUVE is obtained by computing Kullback–Leibler (KL) divergences divergences between the to distributions in a quantized embedding space of a large language model. It can quantify differences in the quality of generated text based on the size of the model, decoding algorithm, and the length of the generated text. MAUVE was found to correlate the strongest with human evaluations over baseline metrics for open-ended text generation.
+  For details, see the MAUVE paper: https://arxiv.org/abs/2102.01454 (NeurIPS, 2021).
   This metrics is a wrapper around the official implementation of MAUVE:
   https://github.com/krishnap25/mauve
 ## Metric description
+MAUVE is a measure of the gap between neural text and human text. It is computed using the [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) between the two distributions of text in a quantized embedding space of a large language model. MAUVE can identify differences in quality arising from model sizes and decoding algorithms.
 This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE.
 `verbose`: If `True` (default), running the metric will print running time updates.
 `seed`: random seed to initialize k-means cluster assignments, randomly assigned by default.
 ## Output values
 ### Values from popular papers
+The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain (computed using 5000 continuations 1024-tokens long with default hyperparameters). The authors found that bigger models generally resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments.
+### Best practices
+It is a good idea to use at least 500-1000 samples for each distribution to compute MAUVE.
+MAUVE is unable to identify very small differences between different settings of generation (e.g., between top-p sampling with p=0.95 versus 0.96). It is important, therefore, to account for the randomness inside the generation (e.g., due to sampling) and within the MAUVE estimation procedure (see the `seed` parameter above).
+Therefore, it is a good idea to obtain generations using multiple random seeds and/or to use rerun MAUVE with multiple values of the parameter `seed`.
 ## Examples