Spaces:

davanstrien
/

corpus-creator

Sleeping

davanstrien HF staff commited on Jun 24, 2024

Commit

f6712d8

1 Parent(s): b717308

improve description

Files changed (1) hide show

app.py CHANGED Viewed

@@ -156,14 +156,22 @@ def update_dataset_card(
         card.push_to_hub(hub_id, token=token)
-description = """
-Corpus Creator is a tool designed to help you easily convert a collection of text files into a dataset suitable for various natural language processing (NLP) tasks.
-In particular the app is focused on splitting texts into chunks of a specified size and overlap. This can be useful for preparing data for synthetic data generation, pipelines or annotation tasks.
-See an [example dataset](https://huggingface.co/datasets/davanstrien/MOH-Bethnal-Green) created using this tool starting from a collection of plain text files.
-The resulting text chunks are stored in a dataset that can be previewed and uploaded to the Hugging Face Hub for easy sharing and access by the community.
-The chunking is done using `Llama-index`'s [`SentenceSplitter`](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/?h=sentencesplitter#sentencesplitter) classes.
 """
 with gr.Blocks() as demo:
@@ -171,7 +179,7 @@ with gr.Blocks() as demo:
     gr.HTML(
         """<h1 style='text-align: center;'> Corpus Creator</h1>
-        <center><i> &#128193; From random files to a Hugging Face dataset in a few steps &#128193; </i></center>"""
     )
     gr.Markdown(description)
     gr.Markdown(

         card.push_to_hub(hub_id, token=token)
+description = """Corpus Creator is a tool for transforming a collection of text files into a Hugging Face  dataset, perfect for various natural language processing (NLP) tasks. Whether you're preparing data for synthetic generation, building pipelines, or setting up annotation tasks, this app simplifies the process.
+Key features:
+- 📁 Easy text file upload
+- ✂️ Customizable text chunking
+- 👁️ Instant dataset preview
+- 🚀 One-click upload to Hugging Face Hubub
+#### Powered by Llama Index
+Corpus Creator leverages the power of Llama Index, a data framework for LLM-based applications. Specifically, we use Llama Index's `SentenceSplitter` class to intelligently chunk your text. This ensures that your dataset is split in a way that preserves semantic meaning, making it ideal for downstream NLP tasks. [Learn more about Llama Index](https://www.llamaindex.ai/)
+Get started by uploading your files and see your corpus take shape!
+[View an example dataset](https://huggingface.co/datasets/davanstrien/MOH-Bethnal-Green) created with Corpus Creator.
 """
 with gr.Blocks() as demo:
     gr.HTML(
         """<h1 style='text-align: center;'> Corpus Creator</h1>
+        <center><i> &#128193; From scattered files to a structured dataset in minutes &#128193; </i></center>"""
     )
     gr.Markdown(description)
     gr.Markdown(