yu-val-weiss commited on
Commit
17ddf40
·
1 Parent(s): 2338f58

update documentation

Browse files
Files changed (2) hide show
  1. README.md +54 -35
  2. blimp.py +0 -2
README.md CHANGED
@@ -45,53 +45,72 @@ results = blimp.compute(model_id='pico-lm/pico-decoder')
45
 
46
  - **model_id** (str): model used for calculating BLiMP.
47
  - **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
 
48
  - **device** (str): device to run on, defaults to `cuda` when available
 
49
 
50
  ### Output Values
51
 
52
- This metric outputs a dictionary with the BLiMP scores for each subdataset.
53
- If one of the input texts is longer than the max input length of the model, then it is truncated to the max length for the perplexity computation.
54
 
55
- ```
56
- {'perplexities': [8.182524681091309, 33.42122268676758, 27.012239456176758], 'mean_perplexity': 22.871995608011883}
57
- ```
58
-
59
- The range of this metric is [0, inf). A lower score is better.
60
-
61
- ### Examples
62
 
63
- Calculating perplexity on predictions defined here:
64
 
65
  ```python
66
- perplexity = evaluate.load("perplexity", module_type="metric")
67
- input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
68
- results = perplexity.compute(model_id='gpt2',
69
- add_start_token=False,
70
- predictions=input_texts)
71
- print(list(results.keys()))
72
- >>>['perplexities', 'mean_perplexity']
73
- print(round(results["mean_perplexity"], 2))
74
- >>>646.75
75
- print(round(results["perplexities"][0], 2))
76
- >>>32.25
77
  ```
78
 
79
- Calculating perplexity on predictions loaded in from a dataset:
 
 
80
 
81
  ```python
82
- perplexity = evaluate.load("perplexity", module_type="metric")
83
- input_texts = datasets.load_dataset("wikitext",
84
- "wikitext-2-raw-v1",
85
- split="test")["text"][:50]
86
- input_texts = [s for s in input_texts if s!='']
87
- results = perplexity.compute(model_id='gpt2',
88
- predictions=input_texts)
89
- print(list(results.keys()))
90
- >>>['perplexities', 'mean_perplexity']
91
- print(round(results["mean_perplexity"], 2))
92
- >>>576.76
93
- print(round(results["perplexities"][0], 2))
94
- >>>889.28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  ```
96
 
97
  ## Citation
 
45
 
46
  - **model_id** (str): model used for calculating BLiMP.
47
  - **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
48
+ - **predictions** (list[str]): names of metrics to run. pass empty list or `["*"]` to run all of them
49
  - **device** (str): device to run on, defaults to `cuda` when available
50
+ - **samples_per_set** (int): the number of samples per metric, defaults to 1_000. Maximum 1_000 (enforced with a `min` call).
51
 
52
  ### Output Values
53
 
54
+ This metric outputs a dictionary containing the blimp scores for each of the 67 sub-datasets, as well as the overall accuracy.
 
55
 
56
+ An LM’s overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence.
 
 
 
 
 
 
57
 
58
+ Each score is in `[0,1]`. A **higher** score is better.
59
 
60
  ```python
61
+ {
62
+ "accuracy": 0.621288127211,
63
+ "by_uid": {
64
+ "adjunct_island": 0.12761212512, # rest of sub-datasets...
65
+ },
66
+ "by_phenomenon": {
67
+ "anaphor_agreement": 0.71287512125, # rest of phenomena...
68
+ },
69
+ }
 
 
70
  ```
71
 
72
+ ### Examples
73
+
74
+ Calculating BLiMP on predictions defined here:
75
 
76
  ```python
77
+ def check_blimp():
78
+ # Load the metric
79
+ blimp = load("pico-lm/blimp")
80
+
81
+ # example with a small language model
82
+ results = blimp.compute(
83
+ model_id="distilgpt2",
84
+ batch_size=16,
85
+ predictions=["*"],
86
+ )
87
+
88
+ # Print results
89
+ print("Overall accuracy:", results["accuracy"])
90
+ >>> Overall accuracy: 0.5035074626865672
91
+ print("Top 5 best performing uids:")
92
+ sorted_results = sorted(results["by_uid"].items(), key=lambda x: x[1], reverse=True)
93
+ for phenomenon, accuracy in sorted_results[:5]:
94
+ print(f"{phenomenon}: {accuracy:.3f}")
95
+ >>> Top 5 best performing uids:
96
+ >>> anaphor_number_agreement: 0.919
97
+ >>> anaphor_gender_agreement: 0.868
98
+ >>> matrix_question_npi_licensor_present: 0.840
99
+ >>> wh_vs_that_no_gap: 0.787
100
+ >>> sentential_negation_npi_licensor_present: 0.729
101
+
102
+ print("Top 5 best performing phenomena:")
103
+ sorted_results = sorted(
104
+ results["by_phenomenon"].items(), key=lambda x: x[1], reverse=True
105
+ )
106
+ for phenomenon, accuracy in sorted_results[:5]:
107
+ print(f"{phenomenon}: {accuracy:.3f}")
108
+ >>> Top 5 best performing phenomena:
109
+ >>> anaphor_agreement: 0.893
110
+ >>> argument_structure: 0.597
111
+ >>> npi_licensing: 0.579
112
+ >>> filler_gap_dependency: 0.561
113
+ >>> control_raising: 0.533
114
  ```
115
 
116
  ## Citation
blimp.py CHANGED
@@ -128,8 +128,6 @@ Args:
128
  Returns:
129
  blimp: dictionary containing the blimp scores for each of the 67 sub-datasets, as well as the overall accuracy.
130
  An LM’s overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence.
131
- Examples:
132
- TODO: examples.
133
  """
134
 
135
 
 
128
  Returns:
129
  blimp: dictionary containing the blimp scores for each of the 67 sub-datasets, as well as the overall accuracy.
130
  An LM’s overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence.
 
 
131
  """
132
 
133