victormiller
commited on
Commit
•
88c0211
1
Parent(s):
6084136
Update web.py
Browse files
web.py
CHANGED
@@ -476,25 +476,38 @@ def web_data():
|
|
476 |
P("""
|
477 |
We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
|
478 |
"""),
|
|
|
|
|
|
|
|
|
479 |
|
480 |
-
|
481 |
-
|
482 |
-
|
483 |
"data/bad_url_doc.jsonl",
|
484 |
3,
|
485 |
"Sample documents whose urls are blocked by the refined url blocklist",
|
|
|
486 |
),
|
|
|
487 |
H5("1.3.2 Excluded High Quality Sources"),
|
488 |
P("""
|
489 |
To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
|
490 |
"""),
|
491 |
-
DVS(
|
492 |
-
non_web_urls,
|
493 |
-
"curated url domains that are excluded from our dataset",
|
494 |
-
),
|
495 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
496 |
|
497 |
-
|
|
|
|
|
|
|
|
|
498 |
|
499 |
H3("2. Line-Level Removal"),
|
500 |
P("""
|
@@ -510,11 +523,17 @@ def web_data():
|
|
510 |
of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
|
511 |
documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
|
512 |
"""),
|
513 |
-
|
|
|
|
|
|
|
514 |
"data/sample_terminal_punc.json",
|
515 |
0,
|
516 |
"Sample documents with lines that are removed by the rule of terminal punctuation",
|
|
|
517 |
),
|
|
|
|
|
518 |
H4('2.1 Word "Javascript"'),
|
519 |
P("""
|
520 |
In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
|
@@ -523,10 +542,13 @@ def web_data():
|
|
523 |
propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
|
524 |
The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
|
525 |
"""),
|
526 |
-
|
527 |
-
"
|
528 |
-
|
529 |
-
|
|
|
|
|
|
|
530 |
),
|
531 |
H4("2.2 Other Rules from RefinedWeb"),
|
532 |
P("""
|
@@ -536,10 +558,13 @@ def web_data():
|
|
536 |
- The line matches the pattern “r'^\\d+\\s+likes$'”,
|
537 |
- The line contains only one word.
|
538 |
"""),
|
539 |
-
|
540 |
-
"
|
541 |
-
|
542 |
-
|
|
|
|
|
|
|
543 |
),
|
544 |
H4("2.3 Toxic Lines"),
|
545 |
P("""
|
@@ -549,10 +574,14 @@ def web_data():
|
|
549 |
line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
|
550 |
the bad words from English but also consider the bad words from other languages.
|
551 |
"""),
|
552 |
-
|
553 |
-
|
554 |
-
|
|
|
|
|
|
|
555 |
),
|
|
|
556 |
H3("3. Document-Level Filtering"),
|
557 |
P("""
|
558 |
In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
|
|
|
476 |
P("""
|
477 |
We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
|
478 |
"""),
|
479 |
+
Details(
|
480 |
+
Summary("6 url domains that are removed from the blocklist"),
|
481 |
+
DVS(urls_false_positives, "6 url domains that are removed from the blocklist"),
|
482 |
+
),
|
483 |
|
484 |
+
Details(
|
485 |
+
Summary("Sample documents whose urls are blocked by the refined url blocklist"),
|
486 |
+
DV(
|
487 |
"data/bad_url_doc.jsonl",
|
488 |
3,
|
489 |
"Sample documents whose urls are blocked by the refined url blocklist",
|
490 |
+
),
|
491 |
),
|
492 |
+
|
493 |
H5("1.3.2 Excluded High Quality Sources"),
|
494 |
P("""
|
495 |
To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
|
496 |
"""),
|
|
|
|
|
|
|
|
|
497 |
|
498 |
+
Details(
|
499 |
+
Summary("curated url domains that are excluded from our dataset"),
|
500 |
+
DVS(
|
501 |
+
non_web_urls,
|
502 |
+
"curated url domains that are excluded from our dataset",
|
503 |
+
),
|
504 |
+
),
|
505 |
|
506 |
+
Details(
|
507 |
+
Summary("Sample documents whose urls are in our curated url domain list"),
|
508 |
+
DV("data/sample_url_exclusion.json", 0, "Sample documents whose urls are in our curated url domain list"),
|
509 |
+
),
|
510 |
+
|
511 |
|
512 |
H3("2. Line-Level Removal"),
|
513 |
P("""
|
|
|
523 |
of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
|
524 |
documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
|
525 |
"""),
|
526 |
+
|
527 |
+
Details(
|
528 |
+
Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
|
529 |
+
DV(
|
530 |
"data/sample_terminal_punc.json",
|
531 |
0,
|
532 |
"Sample documents with lines that are removed by the rule of terminal punctuation",
|
533 |
+
),
|
534 |
),
|
535 |
+
|
536 |
+
|
537 |
H4('2.1 Word "Javascript"'),
|
538 |
P("""
|
539 |
In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
|
|
|
542 |
propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
|
543 |
The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
|
544 |
"""),
|
545 |
+
Details(
|
546 |
+
Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
|
547 |
+
DV(
|
548 |
+
"data/sample_java.jsonl",
|
549 |
+
0,
|
550 |
+
"Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
|
551 |
+
),
|
552 |
),
|
553 |
H4("2.2 Other Rules from RefinedWeb"),
|
554 |
P("""
|
|
|
558 |
- The line matches the pattern “r'^\\d+\\s+likes$'”,
|
559 |
- The line contains only one word.
|
560 |
"""),
|
561 |
+
Details(
|
562 |
+
Summary("Sample documents with lines that are removed by the RefinedWeb rules"),
|
563 |
+
DV(
|
564 |
+
"data/sample_refinedweb_line.json",
|
565 |
+
0,
|
566 |
+
"Sample documents with lines that are removed by the RefinedWeb rules",
|
567 |
+
),
|
568 |
),
|
569 |
H4("2.3 Toxic Lines"),
|
570 |
P("""
|
|
|
574 |
line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
|
575 |
the bad words from English but also consider the bad words from other languages.
|
576 |
"""),
|
577 |
+
Details(
|
578 |
+
Summary("Sample documents with toxic lines"),
|
579 |
+
DVS(
|
580 |
+
json.load(open("data/toxic_lines.json")),
|
581 |
+
"Sample documents with toxic lines",
|
582 |
+
),
|
583 |
),
|
584 |
+
|
585 |
H3("3. Document-Level Filtering"),
|
586 |
P("""
|
587 |
In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
|