yu-val-weiss
commited on
Commit
·
17ddf40
1
Parent(s):
2338f58
update documentation
Browse files
README.md
CHANGED
@@ -45,53 +45,72 @@ results = blimp.compute(model_id='pico-lm/pico-decoder')
|
|
45 |
|
46 |
- **model_id** (str): model used for calculating BLiMP.
|
47 |
- **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
|
|
|
48 |
- **device** (str): device to run on, defaults to `cuda` when available
|
|
|
49 |
|
50 |
### Output Values
|
51 |
|
52 |
-
This metric outputs a dictionary
|
53 |
-
If one of the input texts is longer than the max input length of the model, then it is truncated to the max length for the perplexity computation.
|
54 |
|
55 |
-
|
56 |
-
{'perplexities': [8.182524681091309, 33.42122268676758, 27.012239456176758], 'mean_perplexity': 22.871995608011883}
|
57 |
-
```
|
58 |
-
|
59 |
-
The range of this metric is [0, inf). A lower score is better.
|
60 |
-
|
61 |
-
### Examples
|
62 |
|
63 |
-
|
64 |
|
65 |
```python
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
print(round(results["perplexities"][0], 2))
|
76 |
-
>>>32.25
|
77 |
```
|
78 |
|
79 |
-
|
|
|
|
|
80 |
|
81 |
```python
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
results =
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
95 |
```
|
96 |
|
97 |
## Citation
|
|
|
45 |
|
46 |
- **model_id** (str): model used for calculating BLiMP.
|
47 |
- **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
|
48 |
+
- **predictions** (list[str]): names of metrics to run. pass empty list or `["*"]` to run all of them
|
49 |
- **device** (str): device to run on, defaults to `cuda` when available
|
50 |
+
- **samples_per_set** (int): the number of samples per metric, defaults to 1_000. Maximum 1_000 (enforced with a `min` call).
|
51 |
|
52 |
### Output Values
|
53 |
|
54 |
+
This metric outputs a dictionary containing the blimp scores for each of the 67 sub-datasets, as well as the overall accuracy.
|
|
|
55 |
|
56 |
+
An LM’s overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence.
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
|
58 |
+
Each score is in `[0,1]`. A **higher** score is better.
|
59 |
|
60 |
```python
|
61 |
+
{
|
62 |
+
"accuracy": 0.621288127211,
|
63 |
+
"by_uid": {
|
64 |
+
"adjunct_island": 0.12761212512, # rest of sub-datasets...
|
65 |
+
},
|
66 |
+
"by_phenomenon": {
|
67 |
+
"anaphor_agreement": 0.71287512125, # rest of phenomena...
|
68 |
+
},
|
69 |
+
}
|
|
|
|
|
70 |
```
|
71 |
|
72 |
+
### Examples
|
73 |
+
|
74 |
+
Calculating BLiMP on predictions defined here:
|
75 |
|
76 |
```python
|
77 |
+
def check_blimp():
|
78 |
+
# Load the metric
|
79 |
+
blimp = load("pico-lm/blimp")
|
80 |
+
|
81 |
+
# example with a small language model
|
82 |
+
results = blimp.compute(
|
83 |
+
model_id="distilgpt2",
|
84 |
+
batch_size=16,
|
85 |
+
predictions=["*"],
|
86 |
+
)
|
87 |
+
|
88 |
+
# Print results
|
89 |
+
print("Overall accuracy:", results["accuracy"])
|
90 |
+
>>> Overall accuracy: 0.5035074626865672
|
91 |
+
print("Top 5 best performing uids:")
|
92 |
+
sorted_results = sorted(results["by_uid"].items(), key=lambda x: x[1], reverse=True)
|
93 |
+
for phenomenon, accuracy in sorted_results[:5]:
|
94 |
+
print(f"{phenomenon}: {accuracy:.3f}")
|
95 |
+
>>> Top 5 best performing uids:
|
96 |
+
>>> anaphor_number_agreement: 0.919
|
97 |
+
>>> anaphor_gender_agreement: 0.868
|
98 |
+
>>> matrix_question_npi_licensor_present: 0.840
|
99 |
+
>>> wh_vs_that_no_gap: 0.787
|
100 |
+
>>> sentential_negation_npi_licensor_present: 0.729
|
101 |
+
|
102 |
+
print("Top 5 best performing phenomena:")
|
103 |
+
sorted_results = sorted(
|
104 |
+
results["by_phenomenon"].items(), key=lambda x: x[1], reverse=True
|
105 |
+
)
|
106 |
+
for phenomenon, accuracy in sorted_results[:5]:
|
107 |
+
print(f"{phenomenon}: {accuracy:.3f}")
|
108 |
+
>>> Top 5 best performing phenomena:
|
109 |
+
>>> anaphor_agreement: 0.893
|
110 |
+
>>> argument_structure: 0.597
|
111 |
+
>>> npi_licensing: 0.579
|
112 |
+
>>> filler_gap_dependency: 0.561
|
113 |
+
>>> control_raising: 0.533
|
114 |
```
|
115 |
|
116 |
## Citation
|
blimp.py
CHANGED
@@ -128,8 +128,6 @@ Args:
|
|
128 |
Returns:
|
129 |
blimp: dictionary containing the blimp scores for each of the 67 sub-datasets, as well as the overall accuracy.
|
130 |
An LM’s overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence.
|
131 |
-
Examples:
|
132 |
-
TODO: examples.
|
133 |
"""
|
134 |
|
135 |
|
|
|
128 |
Returns:
|
129 |
blimp: dictionary containing the blimp scores for each of the 67 sub-datasets, as well as the overall accuracy.
|
130 |
An LM’s overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence.
|
|
|
|
|
131 |
"""
|
132 |
|
133 |
|