Kaleidophon commited on
Commit
ab7e385
·
1 Parent(s): d2a50fa

Write README for test

Browse files
Files changed (2) hide show
  1. README.md +56 -16
  2. almost_stochastic_order.py +10 -3
README.md CHANGED
@@ -11,18 +11,25 @@ tags:
11
  - evaluate
12
  - comparison
13
  description: >-
14
- Wilcoxon's test is a signed-rank test for comparing paired samples.
15
  ---
16
 
17
  # Comparison Card for Almost Stochastic Order
18
 
 
 
 
 
 
 
 
19
  ## Comparison description
20
 
21
- Wilcoxon's test is a non-parametric signed-rank test that tests whether the distribution of the differences is symmetric about zero. It can be used to compare the predictions of two models.
22
 
23
  ## How to use
24
 
25
- The Almost Stochastic Order comparison is used to analyze any kind of real-valued data.
26
 
27
  ## Inputs
28
 
@@ -32,13 +39,25 @@ Its arguments are:
32
 
33
  `predictions2`: a list of predictions from the second model.
34
 
35
- ## Output values
 
 
 
 
 
 
 
 
 
 
36
 
37
- The Wilcoxon comparison outputs two things:
38
 
39
- `stat`: The Wilcoxon statistic.
 
 
40
 
41
- `p`: The p value.
42
 
43
  ## Examples
44
 
@@ -48,23 +67,44 @@ Example comparison:
48
  aso = evaluate.load("almost_stochastic_order")
49
  results = aso.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
50
  print(results)
51
- {'stat': 5.0, 'p': 0.625}
52
  ```
53
 
54
  ## Limitations and bias
55
 
56
- The Wilcoxon test is a non-parametric test, so it has relatively few assumptions (basically only that the observations are independent). It should be used to analyze paired ordinal data only.
 
 
 
 
 
 
 
57
 
58
  ## Citations
59
 
60
  ```bibtex
61
- @incollection{wilcoxon1992individual,
62
- title={Individual comparisons by ranking methods},
63
- author={Wilcoxon, Frank},
64
- booktitle={Breakthroughs in statistics},
65
- pages={196--202},
66
- year={1992},
67
- publisher={Springer}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  }
69
  ```
70
 
 
11
  - evaluate
12
  - comparison
13
  description: >-
14
+ The Almost Stochastic Order test is a non-parametric test that assesses the difference between two prediction distributions through their Wasserstein distance.
15
  ---
16
 
17
  # Comparison Card for Almost Stochastic Order
18
 
19
+ The Almost Stochastic Order test is a non-parametric test that assesses to what extent two distributions of predictions differ through measuring the Wasserstein distance from each other.
20
+ It can be used to compare the predictions of two models, and is especially useful for neural networks since it compares the two full distributions, not just their means.
21
+ When model 1 produces overall higher predictions than model 2, the test statistic called violation ratio will be less than 0.5.
22
+
23
+ This version of the test computes a frequentist upper bound to the violation ratio given a pre-specified confidence level (0.95 by default).
24
+ For more information, refer to the [README of the deep-significance package](https://github.com/Kaleidophon/deep-significance) or the relevant publications (Dror et al., 2019; Ulmer et al., 2022).
25
+
26
  ## Comparison description
27
 
28
+ The Almost Stochastic Order test is a non-parametric test that tests to what extent the distributions of predictions differ from each other through measuring their Wasserstein distance. It can be used to compare the predictions of two models.
29
 
30
  ## How to use
31
 
32
+ The Almost Stochastic Order comparison is used to analyze any kind of real-valued predictions.
33
 
34
  ## Inputs
35
 
 
39
 
40
  `predictions2`: a list of predictions from the second model.
41
 
42
+ Its keyword arguments are:
43
+
44
+ `confidence_level`: The confidence level under which the result is obtained as a float. Default is 0.95.
45
+
46
+ `num_bootstrap_iterations`: Number of bootstrap iterations to compute upper bound to test statistics as an integer. Default is 1000.
47
+
48
+ `dt`: Differential for t during numerical integral calculation as a float. Default is 0.005.
49
+
50
+ `num_jobs`: Number of jobs to use for test as an integer. If None, this defaults to value specified in the num_process attribute.
51
+
52
+ `show_progress`: A boolean flag. If True, a progress bar is shown when computing the test statistic. Default is False.
53
 
54
+ `seed`: Set an integer seed for reproducibility purposes. If None, this defaults to the value specified in the seed attribute.
55
 
56
+ ## Output values
57
+
58
+ The Almost Stochastic Order comparison output a single scalar:
59
 
60
+ `violation_ratio`: (Frequentist upper bound to) Degree of violation of the stochastic order as a float between 0 and 1. When it is smaller than 0.5, the model producing predictions1 performs better than the other model at a confidence level specified by confidence_level argument (default is 0.95).
61
 
62
  ## Examples
63
 
 
67
  aso = evaluate.load("almost_stochastic_order")
68
  results = aso.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
69
  print(results)
70
+ {'violation_ratio': 1.0}
71
  ```
72
 
73
  ## Limitations and bias
74
 
75
+ The Almost Stochastic Order test is a non-parametric test, so it comes with no assumptions about the underlying distribution of predictions.
76
+
77
+ We identify the following limitations:
78
+
79
+ - Even though a violation ratio less than 0.5 should be enough to reject the null hypothesis, Ulmer et al. (2022) recommend to reject the null hypothesis when violation_ratio is below 0.2.
80
+
81
+ - Since the test involves a bootstrapping procedure, results may vary between function calls. For this purpose, it is possible to set a seed via the seed argument.
82
+
83
 
84
  ## Citations
85
 
86
  ```bibtex
87
+ @article{ulmer2022deep,
88
+ title={deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks},
89
+ author={Ulmer, Dennis and Hardmeier, Christian and Frellsen, Jes},
90
+ journal={arXiv preprint arXiv:2204.06815},
91
+ year={2022}
92
+ }
93
+
94
+ @inproceedings{dror2019deep,
95
+ author = {Rotem Dror and
96
+ Segev Shlomov and
97
+ Roi Reichart},
98
+ editor = {Anna Korhonen and
99
+ David R. Traum and
100
+ Llu{\'{\i}}s M{\`{a}}rquez},
101
+ title = {Deep Dominance - How to Properly Compare Deep Neural Models},
102
+ booktitle = {Proceedings of the 57th Conference of the Association for Computational
103
+ Linguistics, {ACL} 2019, Florence, Italy, July 28-August 2, 2019,
104
+ Volume 1: Long Papers},
105
+ pages = {2773--2785},
106
+ publisher = {Association for Computational Linguistics},
107
+ year = {2019}
108
  }
109
  ```
110
 
almost_stochastic_order.py CHANGED
@@ -22,7 +22,7 @@ import evaluate
22
 
23
 
24
  _DESCRIPTION = """
25
- The Almost Stochastic Order test is a non-parametric test that tests to what extent the distributions of predictions differ from each other through measuring their Wasserstein distance. It can be used to compare the predictions of two models.
26
  """
27
 
28
 
@@ -36,13 +36,14 @@ Kwargs:
36
  dt (`float`): Differential for t during numerical integral calculation. Default is 0.005.
37
  num_jobs (`int` or None): Number of jobs to use for test. If None, this defaults to value specified in the num_process attribute.
38
  show_progress (`bool`): If True, a progress bar is shown when computing the test statistic. Default is False.
 
39
  Returns:
40
  violation_ratio (`float`): (Frequentist upper bound to) Degree of violation of the stochastic order. When it is smaller than 0.5, the model producing predictions1 performs better than the other model at a confidence level specified by confidence_level argument (default is 0.95). Ulmer et al. (2022) recommend to reject the null hypothesis when violation_ratio is under 0.2.
41
  Examples:
42
  >>> aso = evaluate.load("almost_stochastic_order")
43
  >>> results = aso.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
44
  >>> print(results)
45
- {'violation_ratio': }
46
  """
47
 
48
 
@@ -53,6 +54,7 @@ _CITATION = """
53
  journal={arXiv preprint arXiv:2204.06815},
54
  year={2022}
55
  }
 
56
  @inproceedings{dror2019deep,
57
  author = {Rotem Dror and
58
  Segev Shlomov and
@@ -94,8 +96,13 @@ class AlmostStochasticOrder(evaluate.Comparison):
94
  dt: float = 0.005,
95
  num_jobs: Optional[int] = None,
96
  show_progress: bool = False,
 
97
  **kwargs
98
  ):
 
 
 
 
99
  # Set number of jobs
100
  if num_jobs is None:
101
  num_jobs = self.num_process
@@ -109,7 +116,7 @@ class AlmostStochasticOrder(evaluate.Comparison):
109
  num_bootstrap_iterations=num_bootstrap_iterations,
110
  dt=dt,
111
  num_jobs=num_jobs,
112
- seed=self.seed,
113
  show_progress=show_progress
114
  )
115
  return {"violation_ratio": violation_ratio}
 
22
 
23
 
24
  _DESCRIPTION = """
25
+ The Almost Stochastic Order test is a non-parametric test that tests to what extent the distributions of predictions differ measuring the Wasserstein distance from each other through. It can be used to compare the predictions of two models.
26
  """
27
 
28
 
 
36
  dt (`float`): Differential for t during numerical integral calculation. Default is 0.005.
37
  num_jobs (`int` or None): Number of jobs to use for test. If None, this defaults to value specified in the num_process attribute.
38
  show_progress (`bool`): If True, a progress bar is shown when computing the test statistic. Default is False.
39
+ seed (`int` or None): Set seed for reproducibility purposes. If None, this defaults to the value specified in the seed attribute.
40
  Returns:
41
  violation_ratio (`float`): (Frequentist upper bound to) Degree of violation of the stochastic order. When it is smaller than 0.5, the model producing predictions1 performs better than the other model at a confidence level specified by confidence_level argument (default is 0.95). Ulmer et al. (2022) recommend to reject the null hypothesis when violation_ratio is under 0.2.
42
  Examples:
43
  >>> aso = evaluate.load("almost_stochastic_order")
44
  >>> results = aso.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
45
  >>> print(results)
46
+ {'violation_ratio': 1.0}
47
  """
48
 
49
 
 
54
  journal={arXiv preprint arXiv:2204.06815},
55
  year={2022}
56
  }
57
+
58
  @inproceedings{dror2019deep,
59
  author = {Rotem Dror and
60
  Segev Shlomov and
 
96
  dt: float = 0.005,
97
  num_jobs: Optional[int] = None,
98
  show_progress: bool = False,
99
+ seed: Optional[int] = None,
100
  **kwargs
101
  ):
102
+ # Set seed
103
+ if seed is None:
104
+ seed = self.seed
105
+
106
  # Set number of jobs
107
  if num_jobs is None:
108
  num_jobs = self.num_process
 
116
  num_bootstrap_iterations=num_bootstrap_iterations,
117
  dt=dt,
118
  num_jobs=num_jobs,
119
+ seed=seed,
120
  show_progress=show_progress
121
  )
122
  return {"violation_ratio": violation_ratio}