victormiller
commited on
Commit
•
ccd1474
1
Parent(s):
63f6244
Update common.py
Browse files
common.py
CHANGED
@@ -10,6 +10,26 @@ import string
|
|
10 |
from rich import print
|
11 |
import jsonlines
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
nfc_examples = pd.DataFrame(
|
14 |
{
|
15 |
"Original Text": [
|
@@ -81,7 +101,7 @@ global_div = Div(
|
|
81 |
H3("Why do we need deduplication?"),
|
82 |
P("Deduplication is beneficial for LM pretraining in several ways, the most obvious being the reduction of training data. With less training data, the model requires shorter training times to achieve the same or even better accuracy. Deduplication also helps avoid train-test overlap, thereby improving evaluation metrics. Additionally, it reduces the risk of memorization [1]. Duplicate data can lead to a strong double descent phenomenon, where repeated data causes test loss to increase midway through training [2]. By implementing deduplication and selective upsampling, we gain control over the pretraining data distribution, rather than relying on the inherent distribution of the source, which is often the internet."),
|
83 |
P("To illustrate this, below is the distribution of near-duplicate clusters, organized into buckets of 100. The first bucket contains clusters with sizes ranging from 2 to 100, as found in the Common Crawl dataset. Some clusters even reach up to a million documents."),
|
84 |
-
|
85 |
Img(src="images/100k.png", height = "300", width = "600" ),
|
86 |
P("We started deduplication with 61.8 TB of high-quality, filtered, and compressed documents. The initial dataset had roughly 48.83 billion documents. First, we performed exact deduplication using a Bloom filter with a capacity of 1 billion and a false positive rate of 0.001. This reduced the documents from 48.83 billion to 40.21 billion, removing about 17% as exact duplicates. This step used constant memory for the Bloom filter and lessened the workload for subsequent near-deduplication."),
|
87 |
P("For the global near-deduplication, we employed a methodology used by prior works like SlimPajama [3] but scaled it to the entire dataset which includes 87 Common Crawl dumps (also called “crawls”) and the curated data. This near-deduplication process involved generating signatures for every document, matching these signatures to identify near-duplicates, and then clustering the near-duplicate documents to select all but one for deletion. We choose a curated document over a Common Crawl document and later in time dump than an earlier dump when we choose the one document we keep between the matching cluster. Additionally, we maintained statistics about these matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:"),
|
|
|
10 |
from rich import print
|
11 |
import jsonlines
|
12 |
|
13 |
+
|
14 |
+
|
15 |
+
dump_counts = pd.DataFrame(sorted(data.items()), columns=["cluster_size_range", "counts"])
|
16 |
+
|
17 |
+
fig = px.bar(
|
18 |
+
dump_counts,
|
19 |
+
x="cluster_size_range",
|
20 |
+
y="counts",
|
21 |
+
log_y=True,
|
22 |
+
labels={
|
23 |
+
"cluster_size_range": "Size of Near-Duplicate Clusters (Document Count)",
|
24 |
+
"counts": "Number of Clusters",
|
25 |
+
},
|
26 |
+
)
|
27 |
+
|
28 |
+
fig.update_layout(showlegend=False)
|
29 |
+
dup_cluster_graph = plotly_chart(fig)
|
30 |
+
|
31 |
+
|
32 |
+
|
33 |
nfc_examples = pd.DataFrame(
|
34 |
{
|
35 |
"Original Text": [
|
|
|
101 |
H3("Why do we need deduplication?"),
|
102 |
P("Deduplication is beneficial for LM pretraining in several ways, the most obvious being the reduction of training data. With less training data, the model requires shorter training times to achieve the same or even better accuracy. Deduplication also helps avoid train-test overlap, thereby improving evaluation metrics. Additionally, it reduces the risk of memorization [1]. Duplicate data can lead to a strong double descent phenomenon, where repeated data causes test loss to increase midway through training [2]. By implementing deduplication and selective upsampling, we gain control over the pretraining data distribution, rather than relying on the inherent distribution of the source, which is often the internet."),
|
103 |
P("To illustrate this, below is the distribution of near-duplicate clusters, organized into buckets of 100. The first bucket contains clusters with sizes ranging from 2 to 100, as found in the Common Crawl dataset. Some clusters even reach up to a million documents."),
|
104 |
+
plotly2fasthtml(dup_cluster_graph),
|
105 |
Img(src="images/100k.png", height = "300", width = "600" ),
|
106 |
P("We started deduplication with 61.8 TB of high-quality, filtered, and compressed documents. The initial dataset had roughly 48.83 billion documents. First, we performed exact deduplication using a Bloom filter with a capacity of 1 billion and a false positive rate of 0.001. This reduced the documents from 48.83 billion to 40.21 billion, removing about 17% as exact duplicates. This step used constant memory for the Bloom filter and lessened the workload for subsequent near-deduplication."),
|
107 |
P("For the global near-deduplication, we employed a methodology used by prior works like SlimPajama [3] but scaled it to the entire dataset which includes 87 Common Crawl dumps (also called “crawls”) and the curated data. This near-deduplication process involved generating signatures for every document, matching these signatures to identify near-duplicates, and then clustering the near-duplicate documents to select all but one for deletion. We choose a curated document over a Common Crawl document and later in time dump than an earlier dump when we choose the one document we keep between the matching cluster. Additionally, we maintained statistics about these matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:"),
|