victormiller
commited on
Commit
•
b1b2d47
1
Parent(s):
0e12ce8
Update common.py
Browse files
common.py
CHANGED
@@ -298,7 +298,7 @@ global_div = Div(
|
|
298 |
"Personally Identifiable Information Removal",
|
299 |
style="margin-bottom: 5px",
|
300 |
),
|
301 |
-
Li("
|
302 |
),
|
303 |
id="section1",
|
304 |
),
|
@@ -322,7 +322,7 @@ global_div = Div(
|
|
322 |
"We started deduplication with 61.8 TB of filtered and compressed documents. The initial dataset had roughly 48.83 billion documents. First, we performed exact deduplication using a Bloom filter with a capacity of 1 billion and a false positive rate of 0.001. This reduced the documents from 48.83 billion to 40.21 billion, removing about 17% as exact duplicates. This step used constant memory for the Bloom filter and lessened the workload for subsequent near-deduplication."
|
323 |
),
|
324 |
P(
|
325 |
-
"For the global near-deduplication, we employed a methodology used by prior works like SlimPajama [3] but scaled it to the entire dataset which includes 99 Common Crawl
|
326 |
),
|
327 |
P("We applied the following inclusion criteria for all documents:"),
|
328 |
Ul(
|
@@ -337,7 +337,7 @@ global_div = Div(
|
|
337 |
Section(
|
338 |
H3("MinHash Generation"),
|
339 |
P(
|
340 |
-
"We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before
|
341 |
),
|
342 |
P(B("This step produced 20 TB of hashes.")),
|
343 |
id="section3",
|
@@ -387,7 +387,7 @@ global_div = Div(
|
|
387 |
"We needed to partition the duplicate pairs generated in the third stage into three groups to reduce memory pressure on the final stage. We observed that the second stage itself generates partial components which have some overlap. These overlapping clusters cause some documents to appear in the delete set multiple times. However, our deletion code handled this overlap."
|
388 |
),
|
389 |
P(
|
390 |
-
"Below is the distribution of duplicate documents found across different
|
391 |
),
|
392 |
plotly2fasthtml(dup_docs_count_graph()),
|
393 |
id="section6",
|
@@ -408,10 +408,10 @@ global_div = Div(
|
|
408 |
Img(src="images/image9.png", style="max-width: 100%;"),
|
409 |
),
|
410 |
Section(
|
411 |
-
H2("Personally
|
412 |
-
H3("Motivation Behind Personally
|
413 |
P(
|
414 |
-
"Personally
|
415 |
),
|
416 |
table_div_pii,
|
417 |
),
|
|
|
298 |
"Personally Identifiable Information Removal",
|
299 |
style="margin-bottom: 5px",
|
300 |
),
|
301 |
+
Li("Normalization Form C Discussion", style="margin-bottom: 5px"),
|
302 |
),
|
303 |
id="section1",
|
304 |
),
|
|
|
322 |
"We started deduplication with 61.8 TB of filtered and compressed documents. The initial dataset had roughly 48.83 billion documents. First, we performed exact deduplication using a Bloom filter with a capacity of 1 billion and a false positive rate of 0.001. This reduced the documents from 48.83 billion to 40.21 billion, removing about 17% as exact duplicates. This step used constant memory for the Bloom filter and lessened the workload for subsequent near-deduplication."
|
323 |
),
|
324 |
P(
|
325 |
+
"For the global near-deduplication, we employed a methodology used by prior works like SlimPajama [3] but scaled it to the entire dataset which includes 99 Common Crawl snapshots (also called “crawls”) and the curated data. The near-deduplication process involved generating signatures for every document, matching these signatures to identify near-duplicates, and then clustering the near-duplicate documents to select all but one for deletion."
|
326 |
),
|
327 |
P("We applied the following inclusion criteria for all documents:"),
|
328 |
Ul(
|
|
|
337 |
Section(
|
338 |
H3("MinHash Generation"),
|
339 |
P(
|
340 |
+
"We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before calculating the signature, the text is cleaned by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, a list of 13-grams is generated to use as features for creating a document signature. The globally-unique document IDs and signatures are then saved to disk. The documented ID is designed by an encoding scheme which converts file names and line numbers (there is one document per line) to unique document IDs. This also helped a lot in saving disk and memory for this stage."
|
341 |
),
|
342 |
P(B("This step produced 20 TB of hashes.")),
|
343 |
id="section3",
|
|
|
387 |
"We needed to partition the duplicate pairs generated in the third stage into three groups to reduce memory pressure on the final stage. We observed that the second stage itself generates partial components which have some overlap. These overlapping clusters cause some documents to appear in the delete set multiple times. However, our deletion code handled this overlap."
|
388 |
),
|
389 |
P(
|
390 |
+
"Below is the distribution of duplicate documents found across different snapshots of CommonCrawl. The distribution is skewed to the right because the documents are bucketed by the dump ID of the document we retain, and we prefer documents from higher dump IDs."
|
391 |
),
|
392 |
plotly2fasthtml(dup_docs_count_graph()),
|
393 |
id="section6",
|
|
|
408 |
Img(src="images/image9.png", style="max-width: 100%;"),
|
409 |
),
|
410 |
Section(
|
411 |
+
H2("Personally Identifiable Information Removal"),
|
412 |
+
H3("Motivation Behind Personally Identifiable Information Removal"),
|
413 |
P(
|
414 |
+
"Personally Identifiable Information (PII) refers to any information that can be used to identify an individual, such as names, addresses, phone numbers, email addresses, and social security numbers. PII removal is essential for data privacy and security, as well as for compliance with global regulations. By removing PII from the training data, we can reduce the risk of data breaches and unauthorized access to sensitive information. Additionally, removing PII from training data prevents the models generating that specific PII during inference time."
|
415 |
),
|
416 |
table_div_pii,
|
417 |
),
|