suryanshs16103 commited on
Commit
c3eccc2
Β·
verified Β·
1 Parent(s): 9fba4d8

Update contamination.csv

Browse files

## What are you reporting:

- glue-ax, glue-mnli-matched, glue-mnli-mismatched, glue-mrpc, glue-rte, glue-stsb, glue-wnli dataset found in EleutherAI/pile dataset

**Evaluation dataset(s)**: I have used glue-ax, glue-mnli-matched, glue-mnli-mismatched, glue-mrpc, glue-rte, glue-stsb, glue-wnli. These datasets are not available at Hugging Face.

**Contaminated model(s)**: Not Applicable

**Contaminated corpora**: I have used the Pile dataset. Path to dataset is 'EleutherAI/pile dataset'.

**Contaminated split(s)**: Test split found to be 5.07%, 2.17%, 2.11%, 0.64%, 0.13%, 11.09%, 0.0% respectively in the evaluation datasets.

> You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.

## Briefly describe your method to detect data contamination

#### Data-based approaches

Data contamination is detected using WIMBD, which has two main components: (1) a search tool utilizing an Elasticsearch index for retrieving and analyzing document occurrences, and (2) a count functionality built with map-reduce for quick iteration and extraction of relevant information like duplicates, PII, and domain counts. This allows for scalable analysis and comparison across web-scale datasets.
These values can be verified in Appendix B.3.1 "Benchmark Contamination" of the cited paper.

## Citation

URL: `https://arxiv.org/abs/2310.20707`
Citation: `@misc{elazar2024whats,
title={What's In My Big Data?},
author={Yanai Elazar and Akshita Bhagia and Ian Magnusson and Abhilasha Ravichander and Dustin Schwenk and Alane Suhr and Pete Walsh and Dirk Groeneveld and Luca Soldaini and Sameer Singh and Hanna Hajishirzi and Noah A. Smith and Jesse Dodge},
year={2024},
eprint={2310.20707},
archivePrefix={arXiv},
primaryClass={cs.CL}
}`


*Important!* If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
- Full name: Suryansh Sharma
- Institution: Indian Institute of Technology Kharagpur
- Email: [email protected]

Files changed (1) hide show
  1. contamination_report.csv +8 -0
contamination_report.csv CHANGED
@@ -707,3 +707,11 @@ zest;;EleutherAI/pile;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;
707
  zest;;allenai/c4;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
708
  zest;;oscar-corpus/OSCAR-2301;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
709
  zest;;togethercomputer/RedPajama-Data-V2;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
 
 
 
 
 
 
 
 
 
707
  zest;;allenai/c4;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
708
  zest;;oscar-corpus/OSCAR-2301;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
709
  zest;;togethercomputer/RedPajama-Data-V2;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
710
+
711
+ glue-ax;;EleutherAI/pile;;corpus;;;5.07;data-based;https://arxiv.org/abs/2310.20707
712
+ glue-mnli-matched;;EleutherAI/pile;;corpus;;;2.17;data-based;https://arxiv.org/abs/2310.20707
713
+ glue-mnli-mismatched;;EleutherAI/pile;;corpus;;;2.11;data-based;https://arxiv.org/abs/2310.20707
714
+ glue-mrpc;;EleutherAI/pile;;corpus;;;0.64;data-based;https://arxiv.org/abs/2310.20707
715
+ glue-rte;;EleutherAI/pile;;corpus;;;0.13;data-based;https://arxiv.org/abs/2310.20707
716
+ glue-stsb;;EleutherAI/pile;;corpus;;;11.09;data-based;https://arxiv.org/abs/2310.20707
717
+ glue-wnli;;EleutherAI/pile;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707