Spaces:

kaleidophon
/

almost_stochastic_order

Runtime error

App Files Files Community

Kaleidophon commited on Dec 13, 2022

Commit

ab7e385

1 Parent(s): d2a50fa

Write README for test

Browse files

Files changed (2) hide show

README.md +56 -16
almost_stochastic_order.py +10 -3

README.md CHANGED Viewed

@@ -11,18 +11,25 @@ tags:
 - evaluate
 - comparison
 description: >-
-  Wilcoxon's test is a signed-rank test for comparing paired samples.
 ---
 # Comparison Card for Almost Stochastic Order
 ## Comparison description
-Wilcoxon's test is a non-parametric signed-rank test that tests whether the distribution of the differences is symmetric about zero. It can be used to compare the predictions of two models.
 ## How to use
-The Almost Stochastic Order comparison is used to analyze any kind of real-valued data.
 ## Inputs
@@ -32,13 +39,25 @@ Its arguments are:
 `predictions2`: a list of predictions from the second model.
-## Output values
-The Wilcoxon comparison outputs two things:
-`stat`: The Wilcoxon statistic.
-`p`: The p value.
 ## Examples
@@ -48,23 +67,44 @@ Example comparison:
 aso = evaluate.load("almost_stochastic_order")
 results = aso.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
 print(results)
-{'stat': 5.0, 'p': 0.625}
 ```
 ## Limitations and bias
-The Wilcoxon test is a non-parametric test, so it has relatively few assumptions (basically only that the observations are independent). It should be used to analyze paired ordinal data only.
 ## Citations
 ```bibtex
-@incollection{wilcoxon1992individual,
-  title={Individual comparisons by ranking methods},
-  author={Wilcoxon, Frank},
-  booktitle={Breakthroughs in statistics},
-  pages={196--202},
-  year={1992},
-  publisher={Springer}
 }
 ```

 - evaluate
 - comparison
 description: >-
+  The Almost Stochastic Order test is a non-parametric test that assesses the difference between two prediction distributions through their Wasserstein distance.
 ---
 # Comparison Card for Almost Stochastic Order
+The Almost Stochastic Order test is a non-parametric test that assesses to what extent two distributions of predictions differ through measuring the Wasserstein distance from each other.
+It can be used to compare the predictions of two models, and is especially useful for neural networks since it compares the two full distributions, not just their means.
+When model 1 produces overall higher predictions than model 2, the test statistic called violation ratio will be less than 0.5.
+This version of the test computes a frequentist upper bound to the violation ratio given a pre-specified confidence level (0.95 by default).
+For more information, refer to the [README of the deep-significance package](https://github.com/Kaleidophon/deep-significance) or the relevant publications (Dror et al., 2019; Ulmer et al., 2022).
 ## Comparison description
+The Almost Stochastic Order test is a non-parametric test that tests to what extent the distributions of predictions differ from each other through measuring their Wasserstein distance. It can be used to compare the predictions of two models.
 ## How to use
+The Almost Stochastic Order comparison is used to analyze any kind of real-valued predictions.
 ## Inputs
 `predictions2`: a list of predictions from the second model.
+Its keyword arguments are:
+`confidence_level`: The confidence level under which the result is obtained as a float. Default is 0.95.
+`num_bootstrap_iterations`: Number of bootstrap iterations to compute upper bound to test statistics as an integer. Default is 1000.
+`dt`: Differential for t during numerical integral calculation as a float. Default is 0.005.
+`num_jobs`: Number of jobs to use for test as an integer. If None, this defaults to value specified in the num_process attribute.
+`show_progress`: A boolean flag. If True, a progress bar is shown when computing the test statistic. Default is False.
+`seed`: Set an integer seed for reproducibility purposes. If None, this defaults to the value specified in the seed attribute.
+## Output values
+The Almost Stochastic Order comparison output a single scalar:
+`violation_ratio`: (Frequentist upper bound to) Degree of violation of the stochastic order as a float between 0 and 1. When it is smaller than 0.5, the model producing predictions1 performs better than the other model at a confidence level specified by confidence_level argument (default is 0.95).
 ## Examples
 aso = evaluate.load("almost_stochastic_order")
 results = aso.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
 print(results)
+{'violation_ratio': 1.0}
 ```
 ## Limitations and bias
+The Almost Stochastic Order test is a non-parametric test, so it comes with no assumptions about the underlying distribution of predictions.
+We identify the following limitations:
+- Even though a violation ratio less than 0.5 should be enough to reject the null hypothesis, Ulmer et al. (2022) recommend to reject the null hypothesis when violation_ratio is below 0.2.
+- Since the test involves a bootstrapping procedure, results may vary between function calls. For this purpose, it is possible to set a seed via the seed argument.
 ## Citations
 ```bibtex
+@article{ulmer2022deep,
+  title={deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks},
+  author={Ulmer, Dennis and Hardmeier, Christian and Frellsen, Jes},
+  journal={arXiv preprint arXiv:2204.06815},
+  year={2022}
+}
+@inproceedings{dror2019deep,
+  author    = {Rotem Dror and
+               Segev Shlomov and
+               Roi Reichart},
+  editor    = {Anna Korhonen and
+               David R. Traum and
+               Llu{\'{\i}}s M{\`{a}}rquez},
+  title     = {Deep Dominance - How to Properly Compare Deep Neural Models},
+  booktitle = {Proceedings of the 57th Conference of the Association for Computational
+               Linguistics, {ACL} 2019, Florence, Italy, July 28-August 2, 2019,
+               Volume 1: Long Papers},
+  pages     = {2773--2785},
+  publisher = {Association for Computational Linguistics},
+  year      = {2019}
 }
 ```

almost_stochastic_order.py CHANGED Viewed

@@ -22,7 +22,7 @@ import evaluate
 _DESCRIPTION = """
-The Almost Stochastic Order test is a non-parametric test that tests to what extent the distributions of predictions differ from each other through measuring their Wasserstein distance. It can be used to compare the predictions of two models.
 """
@@ -36,13 +36,14 @@ Kwargs:
     dt (`float`): Differential for t during numerical integral calculation. Default is 0.005.
     num_jobs (`int` or None): Number of jobs to use for test. If None, this defaults to value specified in the num_process attribute.
     show_progress (`bool`): If True, a progress bar is shown when computing the test statistic. Default is False.
 Returns:
     violation_ratio (`float`): (Frequentist upper bound to) Degree of violation of the stochastic order. When it is smaller than 0.5, the model producing predictions1 performs better than the other model at a confidence level specified by confidence_level argument (default is 0.95). Ulmer et al. (2022) recommend to reject the null hypothesis when violation_ratio is under 0.2.
 Examples:
     >>> aso = evaluate.load("almost_stochastic_order")
     >>> results = aso.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
     >>> print(results)
-    {'violation_ratio': }
 """
@@ -53,6 +54,7 @@ _CITATION = """
   journal={arXiv preprint arXiv:2204.06815},
   year={2022}
 }
 @inproceedings{dror2019deep,
   author    = {Rotem Dror and
                Segev Shlomov and
@@ -94,8 +96,13 @@ class AlmostStochasticOrder(evaluate.Comparison):
         dt: float = 0.005,
         num_jobs: Optional[int] = None,
         show_progress: bool = False,
         **kwargs
     ):
         # Set number of jobs
         if num_jobs is None:
             num_jobs = self.num_process
@@ -109,7 +116,7 @@ class AlmostStochasticOrder(evaluate.Comparison):
             num_bootstrap_iterations=num_bootstrap_iterations,
             dt=dt,
             num_jobs=num_jobs,
-            seed=self.seed,
             show_progress=show_progress
         )
         return {"violation_ratio": violation_ratio}

 _DESCRIPTION = """
+The Almost Stochastic Order test is a non-parametric test that tests to what extent the distributions of predictions differ measuring the Wasserstein distance from each other through. It can be used to compare the predictions of two models.
 """
     dt (`float`): Differential for t during numerical integral calculation. Default is 0.005.
     num_jobs (`int` or None): Number of jobs to use for test. If None, this defaults to value specified in the num_process attribute.
     show_progress (`bool`): If True, a progress bar is shown when computing the test statistic. Default is False.
+    seed (`int` or None):  Set seed for reproducibility purposes. If None, this defaults to the value specified in the seed attribute.
 Returns:
     violation_ratio (`float`): (Frequentist upper bound to) Degree of violation of the stochastic order. When it is smaller than 0.5, the model producing predictions1 performs better than the other model at a confidence level specified by confidence_level argument (default is 0.95). Ulmer et al. (2022) recommend to reject the null hypothesis when violation_ratio is under 0.2.
 Examples:
     >>> aso = evaluate.load("almost_stochastic_order")
     >>> results = aso.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
     >>> print(results)
+    {'violation_ratio': 1.0}
 """
   journal={arXiv preprint arXiv:2204.06815},
   year={2022}
 }
 @inproceedings{dror2019deep,
   author    = {Rotem Dror and
                Segev Shlomov and
         dt: float = 0.005,
         num_jobs: Optional[int] = None,
         show_progress: bool = False,
+        seed: Optional[int] = None,
         **kwargs
     ):
+        # Set seed
+        if seed is None:
+            seed = self.seed
         # Set number of jobs
         if num_jobs is None:
             num_jobs = self.num_process
             num_bootstrap_iterations=num_bootstrap_iterations,
             dt=dt,
             num_jobs=num_jobs,
+            seed=seed,
             show_progress=show_progress
         )
         return {"violation_ratio": violation_ratio}