TxT360

Running

App Files Files Community

omkarenator commited on Oct 4, 2024

Commit

e384d00

1 Parent(s): ccfaf0a

add distill-style authors, front-matter

Browse files

Files changed (1) hide show

main.py +128 -40

main.py CHANGED Viewed

@@ -39,28 +39,86 @@ app, rt = fast_app(
 )
-front_matter = """
-<d-front-matter>
-<script id='distill-front-matter' type="text/json">{
-    "title": "",
-    "description": "",
-    "published": "",
-    "affiliation": {},
     "authors": [
-      {
-        "author":"",
-        "authorURL":""
-      }
     ],
-    "katex": {
-      "delimiters": [
-        {"left": "$$", "right": "$$", "display": false}
-      ]
-    }
-  }
-</script>
-</d-front-matter>
-"""
 def read_bibs():
@@ -78,6 +136,8 @@ def get():
 @app.get("/")
 def main():
     return Div(
         D_title(
             H1(
@@ -91,7 +151,14 @@ def main():
                 cls="main-plot-container l-page",
             ),
         ),
-        Div(D_byline(), NotStr(front_matter), style="display: none;"),
         D_article(
             D_contents(
                 Nav(
@@ -358,7 +425,6 @@ new_dataset_comparison1 = pd.DataFrame(
             "EuroParl",
             "StackExchange",
             "Code",
         ],
         "TxT360": [
             "99",
@@ -451,7 +517,7 @@ new_dataset_comparison1 = pd.DataFrame(
             "",
             " ",
             "",
-           "Included",
             "-",
             "-",
             "-",
@@ -473,16 +539,18 @@ new_dataset_comparison1 = pd.DataFrame(
             "Included",
         ],
     }
-)
 styled_table = (
     new_dataset_comparison1.style.applymap(
         lambda _: "background-color: #E1EEDB",  # Green background for col 1
-        subset=pd.IndexSlice[:, "TxT360"]
     )
     .applymap(
         lambda _: "background-color: white",  # White background for all other columns
-        subset=pd.IndexSlice[:, new_dataset_comparison1.columns.difference(["TxT360"])]  # Apply to all columns except "TxT360"
     )
     .hide(axis="index")  # Hide the row index
 )
@@ -762,7 +830,14 @@ styled_table = (
     .set_properties(**{"text-align": "center"})  # Center the text in all cells
     .set_table_styles(
         [
-            {"selector": "table", "props": [("margin-left", "20%"), ("margin-right", "auto"), ("width", "100%")]},  # Center the table and adjust width
         ]
     )
     .hide(axis="index")  # Hide the row index
@@ -770,7 +845,9 @@ styled_table = (
 table_html_data = styled_table._repr_html_()
 # table_html_data = dataset_sources.to_html(index=False, border=0)
-table_div_data = Div(NotStr(table_html_data), style="margin-left: auto; width: 80%; align: center;")
 @app.get("/intro")
@@ -779,15 +856,24 @@ def intro():
         Section(
             H2("About TxT360"),
             P(
-                B("We introduce TxT360 (Trillion eXtracted Text) the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 commonly used non-web data sources (e.g. FreeLaw, PG-19, etc.) providing pretraining teams with a recipe to easily adjust data weighting and train the most performant models.")
             ),
             P(
                 "Building on top of the prior studies on pre-training data,",
-                D_cite(bibtex_key="refinedweb"), D_cite(bibtex_key="fineweb"), D_cite(bibtex_key="c4"), D_cite(bibtex_key="muennighoff2023scaling"),
-                "TxT360 carefully implements data processing steps including extraction, filtering, deduplication, personally identifiable information removal, and other steps."
             ),
             P(
-                "Metadata is stored to recover the raw distribution for each dataset, enabling fine-grained control to create data distributions and corpus of desired size. As an example, we present one simple upsampling scheme that takes into account the duplication counts, resulting in a 15~16 trillion token corpus, outperforming FineWeb and our non-upsampling baselines, on diverse evaluations. Unlike DCLM", D_cite(bibtex_key="dclm"), "and RedPajama V2,", D_cite(bibtex_key="redpajama-v2"), "we present the final deduplicated dataset that is ready to go."
             ),
             P(
                 "We documented all implementation details in this blog post and are open sourcing the code. Examples of each filter and rationale supporting each decision are included."
@@ -800,14 +886,16 @@ def intro():
                 "TxT360 is the first dataset to combine both web and curated data sources commonly used in pretraining."
             ),
             new_table_div_1,
-            #table_div_1,
-            #table_div_2,
             P(
                 "In pretraining, it is common to combine web data and curated sources (cite). Web data is included to provide a vast quantity of long tail and diverse data, while curated datasets are often information rich and provide the 'deep-dive' domain information. Combining both datasets plays a critical role for effective LLM pre-training. By integrating the reach of web data with the quality of curated sources, TxT360 meets and surpasses the rigorous standards required for state-of-the-art LLM pre-training. See Results section below."
             ),
-            P("** TxT360 does not include code. This decision was made due to the perceived low duplication code with other sources."),
-            #P("Table 2: Basic TxT360 Statistics."),
-            #table_div_data,
             id="section2",
         ),
         Section(
@@ -825,10 +913,10 @@ def intro():
             P(
                 "We provide details and context for the choices behind TxT360 in the respective Web Data Processing and Curated Source Processing section. A deep dive describing the deduplication process can be found in the Commonly Applied Processing Steps section."
             ),
-            #Img(src="images/pipeline.png", height="300", width="600"),
-            #P(
             #    "Figure 1: Data processing pipeline. All the steps are adopted for processing web data while the yellow blocks are adopted for processing curated sources."
-            #),
             id="section3",
         ),
         id="inner-text",

 )
+front_matter = {
+    "title": "TxT360",
+    "description": "A globally deduplicated dataset for LLM pretraining",
+    "published": "October 7, 2024",
     "authors": [
+        {
+            "author": "Liping Tang",
+            "authorURL": "https://huggingface.co/Liping",
+            "affiliation": "MBZUAI",
+            "affiliationURL": "LLM360.ai",
+        },
+        {
+            "author": "Nikhil Ranjan",
+            "authorURL": "https://huggingface.co/NikhilRanjan",
+            "affiliation": "MBZUAI",
+            "affiliationURL": "",
+        },
+        {
+            "author": "Omkar Pangarkar",
+            "authorURL": "https://huggingface.co/omkarenator",
+            "affiliation": "Petuum, Inc.",
+            "affiliationURL": "",
+        },
+        {
+            "author": "Zhen Wang",
+            "authorURL": "https://huggingface.co/ZhenWang",
+            "affiliation": "MBZUAI",
+            "affiliationURL": "",
+        },
+        {
+            "author": "An Li",
+            "authorURL": "https://huggingface.co/AnLi",
+            "affiliation": "",
+            "affiliationURL": "",
+        },
+        {
+            "author": "Zhoujun Cheng",
+            "authorURL": "https://huggingface.co/ZhoujunCheng",
+            "affiliation": "",
+            "affiliationURL": "",
+        },
+        {
+            "author": "Suqi Sun",
+            "authorURL": "https://huggingface.co/SuqiSun",
+            "affiliation": "Petuum, Inc.",
+            "affiliationURL": "",
+        },
+        {
+            "author": "Cun Mu",
+            "authorURL": "https://huggingface.co/CunMu",
+            "affiliation": "",
+            "affiliationURL": "",
+        },
+        {
+            "author": "Victor Miller",
+            "authorURL": "https://huggingface.co/VictorMiller",
+            "affiliation": "",
+            "affiliationURL": "",
+        },
+        {
+            "author": "Yue Peng",
+            "authorURL": "https://huggingface.co/YuePeng",
+            "affiliation": "",
+            "affiliationURL": "",
+        },
+        {
+            "author": "Eric P. Xing",
+            "authorURL": "https://huggingface.co/EricXing",
+            "affiliation": "MBZUAI & CMU",
+            "affiliationURL": "https://www.mbzuai.ac.ae/ & https://www.cs.cmu.edu/",
+        },
+        {
+            "author": "Zhengzhong Liu",
+            "authorURL": "https://huggingface.co/ZhengzhongLiu",
+            "affiliation": "",
+            "affiliationURL": "",
+        },
     ],
+    "katex": {"delimiters": [{"left": "$$", "right": "$$", "display": "false"}]},
+}
 def read_bibs():
 @app.get("/")
 def main():
+    from fasthtml.xtend import Script
     return Div(
         D_title(
             H1(
                 cls="main-plot-container l-page",
             ),
         ),
+        D_byline(),
+        D_front_matter(
+            Script(
+                json.dumps(front_matter),
+                id="distill-front-matter",
+                type="text/json",
+            )
+        ),
         D_article(
             D_contents(
                 Nav(
             "EuroParl",
             "StackExchange",
             "Code",
         ],
         "TxT360": [
             "99",
             "",
             " ",
             "",
+            "Included",
             "-",
             "-",
             "-",
             "Included",
         ],
     }
+)
 styled_table = (
     new_dataset_comparison1.style.applymap(
         lambda _: "background-color: #E1EEDB",  # Green background for col 1
+        subset=pd.IndexSlice[:, "TxT360"],
     )
     .applymap(
         lambda _: "background-color: white",  # White background for all other columns
+        subset=pd.IndexSlice[
+            :, new_dataset_comparison1.columns.difference(["TxT360"])
+        ],  # Apply to all columns except "TxT360"
     )
     .hide(axis="index")  # Hide the row index
 )
     .set_properties(**{"text-align": "center"})  # Center the text in all cells
     .set_table_styles(
         [
+            {
+                "selector": "table",
+                "props": [
+                    ("margin-left", "20%"),
+                    ("margin-right", "auto"),
+                    ("width", "100%"),
+                ],
+            },  # Center the table and adjust width
         ]
     )
     .hide(axis="index")  # Hide the row index
 table_html_data = styled_table._repr_html_()
 # table_html_data = dataset_sources.to_html(index=False, border=0)
+table_div_data = Div(
+    NotStr(table_html_data), style="margin-left: auto; width: 80%; align: center;"
+)
 @app.get("/intro")
         Section(
             H2("About TxT360"),
             P(
+                B(
+                    "We introduce TxT360 (Trillion eXtracted Text) the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 commonly used non-web data sources (e.g. FreeLaw, PG-19, etc.) providing pretraining teams with a recipe to easily adjust data weighting and train the most performant models."
+                )
             ),
             P(
                 "Building on top of the prior studies on pre-training data,",
+                D_cite(bibtex_key="refinedweb"),
+                D_cite(bibtex_key="fineweb"),
+                D_cite(bibtex_key="c4"),
+                D_cite(bibtex_key="muennighoff2023scaling"),
+                "TxT360 carefully implements data processing steps including extraction, filtering, deduplication, personally identifiable information removal, and other steps.",
             ),
             P(
+                "Metadata is stored to recover the raw distribution for each dataset, enabling fine-grained control to create data distributions and corpus of desired size. As an example, we present one simple upsampling scheme that takes into account the duplication counts, resulting in a 15~16 trillion token corpus, outperforming FineWeb and our non-upsampling baselines, on diverse evaluations. Unlike DCLM",
+                D_cite(bibtex_key="dclm"),
+                "and RedPajama V2,",
+                D_cite(bibtex_key="redpajama-v2"),
+                "we present the final deduplicated dataset that is ready to go.",
             ),
             P(
                 "We documented all implementation details in this blog post and are open sourcing the code. Examples of each filter and rationale supporting each decision are included."
                 "TxT360 is the first dataset to combine both web and curated data sources commonly used in pretraining."
             ),
             new_table_div_1,
+            # table_div_1,
+            # table_div_2,
             P(
                 "In pretraining, it is common to combine web data and curated sources (cite). Web data is included to provide a vast quantity of long tail and diverse data, while curated datasets are often information rich and provide the 'deep-dive' domain information. Combining both datasets plays a critical role for effective LLM pre-training. By integrating the reach of web data with the quality of curated sources, TxT360 meets and surpasses the rigorous standards required for state-of-the-art LLM pre-training. See Results section below."
             ),
+            P(
+                "** TxT360 does not include code. This decision was made due to the perceived low duplication code with other sources."
+            ),
+            # P("Table 2: Basic TxT360 Statistics."),
+            # table_div_data,
             id="section2",
         ),
         Section(
             P(
                 "We provide details and context for the choices behind TxT360 in the respective Web Data Processing and Curated Source Processing section. A deep dive describing the deduplication process can be found in the Commonly Applied Processing Steps section."
             ),
+            # Img(src="images/pipeline.png", height="300", width="600"),
+            # P(
             #    "Figure 1: Data processing pipeline. All the steps are adopted for processing web data while the yellow blocks are adopted for processing curated sources."
+            # ),
             id="section3",
         ),
         id="inner-text",