tolgadev commited on
Commit
edaf319
Β·
verified Β·
1 Parent(s): 095e8d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -118
README.md CHANGED
@@ -1,118 +1,131 @@
1
- # πŸ¦œοΈπŸ”— LangChain Text Chunker
2
-
3
- [![Python 3.8+](https://img.shields.io/badge/Python-3.8%2B-blue?style=flat&logo=python)](https://www.python.org/)
4
- [![Gradio](https://img.shields.io/badge/Built%20with-Gradio-FF6600?style=flat&logo=gradio)](https://gradio.app/)
5
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6
-
7
- ## Description
8
-
9
- Welcome to the πŸ¦œοΈπŸ”— LangChain Text Chunker application! This interactive tool, built with Gradio, empowers users to effortlessly upload various document types, extract their raw text content, and then apply a diverse set of LangChain text splitting (chunking) methods. It provides a clear visualization of how each method breaks down text into smaller, manageable chunks, complete with their associated metadata. Furthermore, for developers and researchers, the application dynamically generates Python code examples, allowing for easy replication and integration of the chunking strategies.
10
-
11
- ## Features
12
-
13
- * **Multi-Document Type Support**: Seamlessly process text from a wide range of document formats, including:
14
- * PDF (`.pdf`)
15
- * Microsoft Word (`.docx`)
16
- * Plain Text (`.txt`)
17
- * HTML (`.html`)
18
- * CSS (`.css`)
19
- * Python Code (`.py`)
20
- * Jupyter Notebooks (`.ipynb`)
21
- * CSV (`.csv`)
22
- * **Diverse Chunking Strategies**: Explore and compare the output of various LangChain text splitters:
23
- * **Recursive Character Text Splitter**: Ideal for general-purpose text, attempting to split on a list of characters in order.
24
- * **Character Text Splitter**: Splits text based on a single, user-defined separator.
25
- * **Markdown Text Splitter**: Specifically designed to understand and preserve the structure of Markdown documents.
26
- * **Python Code Text Splitter**: Optimized for splitting Python source code while maintaining syntactical integrity.
27
- * **JavaScript Code Text Splitter**: Utilizes language-specific rules to chunk JavaScript code effectively.
28
- * **Customizable Chunking Parameters**: Fine-tune the chunking process with adjustable parameters:
29
- * `Chunk Size`: Define the maximum size of the generated chunks.
30
- * `Chunk Overlap`: Specify the number of characters that overlap between consecutive chunks.
31
- * `Character Splitter Separator`: Choose custom separators for the Character Chunking method.
32
- * `Keep Separator`: Control whether the separator is included in the chunk and its placement.
33
- * `Add Start Index to Metadata`: Option to include the starting character index of each chunk in its metadata.
34
- * `Strip Whitespace`: Automatically remove leading/trailing whitespace from chunks.
35
- * **Interactive Chunk Visualization**: View the resulting chunks in a clear, structured JSON format within the Gradio interface.
36
- * **Dynamic Python Code Examples**: For each chunking method, the application generates ready-to-use Python code, demonstrating how to achieve the same chunking results programmatically. This is invaluable for integrating these strategies into your own projects.
37
- * **User-Friendly Gradio Interface**: An intuitive web interface that makes it easy for anyone to experiment with text chunking without deep programming knowledge.
38
-
39
- ## Installation
40
-
41
- To get this application up and running on your local machine, follow these steps:
42
-
43
- ### Prerequisites
44
-
45
- * Python 3.8 or higher
46
-
47
- ### Steps
48
-
49
- 1. **Clone the repository:**
50
- ```bash
51
- git clone https://github.com/tolgakurtuluss/langchain-text-chunker.git
52
- cd langchain-text-chunker
53
- ```
54
-
55
- 2. **Create a virtual environment (recommended):**
56
- ```bash
57
- python -m venv venv
58
- ```
59
-
60
- 3. **Activate the virtual environment:**
61
- * **On Windows:**
62
- ```bash
63
- .\venv\Scripts\activate
64
- ```
65
- * **On macOS/Linux:**
66
- ```bash
67
- source venv/bin/activate
68
- ```
69
-
70
- 4. **Install dependencies:**
71
- ```bash
72
- pip install -r requirements.txt
73
- ```
74
-
75
- ## Usage
76
-
77
- Once the installation is complete, you can run the Gradio application:
78
-
79
- 1. **Run the application:**
80
- ```bash
81
- python app.py
82
- ```
83
- This command will start the Gradio server, and you will typically see a local URL (e.g., `http://127.0.0.1:7860`) in your terminal. Open this URL in your web browser.
84
-
85
- 2. **Using the Interface:**
86
- * **Upload your document**: Use the "Upload your document" file input to select a file (PDF, DOCX, TXT, HTML, CSS, PY, IPYNB, CSV).
87
- * **Adjust Chunking Parameters**: Utilize the sliders, dropdowns, and checkboxes in the "Chunking Parameters" accordion to customize `Chunk Size`, `Chunk Overlap`, `Character Splitter Separator`, `Keep Separator` behavior, `Add Start Index` to metadata, and `Strip Whitespace`.
88
- * **Process Document**: Click the "Process Document" button. The extracted raw text will appear, and the results of various chunking methods will be displayed in their respective tabs.
89
- * **Explore Chunks**: Navigate through the tabs ("Recursive Chunking", "Character Chunking", etc.) to see the chunks as JSON, along with the total number of chunks created for each method.
90
- * **Python Example Code**: In each chunking tab, you can view dynamically generated Python code that demonstrates how to achieve the same chunking results programmatically.
91
-
92
- ### Inspiration
93
-
94
- This Gradio application is inspired by and inferred from [Mervin Praison's insightful work](https://mer.vin/2024/03/chunking-strategy/) on "Advanced Chunking Strategies."
95
-
96
- ## Screenshots
97
-
98
- *Interface for interacting with "Attention is All You Need 1706.03762" paper.*
99
- ![interface](assets/1.JPG)
100
-
101
- *Chunking results of Recursice Chunking Method.*
102
- ![interface](assets/2.JPG)
103
-
104
-
105
- ## Contributing
106
-
107
- Contributions are welcome! If you have suggestions for improvements or new features, please follow these steps:
108
-
109
- 1. Fork the repository.
110
- 2. Create a new branch (`git checkout -b feature/YourFeature`).
111
- 3. Make your changes.
112
- 4. Commit your changes (`git commit -m 'Add some feature'`).
113
- 5. Push to the branch (`git push origin feature/YourFeature`).
114
- 6. Open a Pull Request.
115
-
116
- ## License
117
-
118
- This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: LangChain Text Chunker
3
+ emoji: πŸ¦œοΈπŸ”—
4
+ colorFrom: red
5
+ colorTo: indigo
6
+ app_file: app.py
7
+ pinned: false
8
+ license: mit
9
+ sdk: gradio
10
+ thumbnail: >-
11
+ https://cdn-uploads.huggingface.co/production/uploads/6045600bb79a75142576efa7/N-4MGqcm_Km5UkuwYJlYL.jpeg
12
+ ---
13
+
14
+ # πŸ¦œοΈπŸ”— LangChain Text Chunker
15
+
16
+ [![Python 3.8+](https://img.shields.io/badge/Python-3.8%2B-blue?style=flat&logo=python)](https://www.python.org/)
17
+ [![Gradio](https://img.shields.io/badge/Built%20with-Gradio-FF6600?style=flat&logo=gradio)](https://gradio.app/)
18
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
19
+
20
+ ## Description
21
+
22
+ Welcome to the πŸ¦œοΈπŸ”— LangChain Text Chunker application! This interactive tool, built with Gradio, empowers users to effortlessly upload various document types, extract their raw text content, and then apply a diverse set of LangChain text splitting (chunking) methods. It provides a clear visualization of how each method breaks down text into smaller, manageable chunks, complete with their associated metadata. Furthermore, for developers and researchers, the application dynamically generates Python code examples, allowing for easy replication and integration of the chunking strategies.
23
+
24
+ ## Features
25
+
26
+ * **Multi-Document Type Support**: Seamlessly process text from a wide range of document formats, including:
27
+ * PDF (`.pdf`)
28
+ * Microsoft Word (`.docx`)
29
+ * Plain Text (`.txt`)
30
+ * HTML (`.html`)
31
+ * CSS (`.css`)
32
+ * Python Code (`.py`)
33
+ * Jupyter Notebooks (`.ipynb`)
34
+ * CSV (`.csv`)
35
+ * **Diverse Chunking Strategies**: Explore and compare the output of various LangChain text splitters:
36
+ * **Recursive Character Text Splitter**: Ideal for general-purpose text, attempting to split on a list of characters in order.
37
+ * **Character Text Splitter**: Splits text based on a single, user-defined separator.
38
+ * **Markdown Text Splitter**: Specifically designed to understand and preserve the structure of Markdown documents.
39
+ * **Python Code Text Splitter**: Optimized for splitting Python source code while maintaining syntactical integrity.
40
+ * **JavaScript Code Text Splitter**: Utilizes language-specific rules to chunk JavaScript code effectively.
41
+ * **Customizable Chunking Parameters**: Fine-tune the chunking process with adjustable parameters:
42
+ * `Chunk Size`: Define the maximum size of the generated chunks.
43
+ * `Chunk Overlap`: Specify the number of characters that overlap between consecutive chunks.
44
+ * `Character Splitter Separator`: Choose custom separators for the Character Chunking method.
45
+ * `Keep Separator`: Control whether the separator is included in the chunk and its placement.
46
+ * `Add Start Index to Metadata`: Option to include the starting character index of each chunk in its metadata.
47
+ * `Strip Whitespace`: Automatically remove leading/trailing whitespace from chunks.
48
+ * **Interactive Chunk Visualization**: View the resulting chunks in a clear, structured JSON format within the Gradio interface.
49
+ * **Dynamic Python Code Examples**: For each chunking method, the application generates ready-to-use Python code, demonstrating how to achieve the same chunking results programmatically. This is invaluable for integrating these strategies into your own projects.
50
+ * **User-Friendly Gradio Interface**: An intuitive web interface that makes it easy for anyone to experiment with text chunking without deep programming knowledge.
51
+
52
+ ## Installation
53
+
54
+ To get this application up and running on your local machine, follow these steps:
55
+
56
+ ### Prerequisites
57
+
58
+ * Python 3.8 or higher
59
+
60
+ ### Steps
61
+
62
+ 1. **Clone the repository:**
63
+ ```bash
64
+ git clone https://github.com/tolgakurtuluss/langchain-text-chunker.git
65
+ cd langchain-text-chunker
66
+ ```
67
+
68
+ 2. **Create a virtual environment (recommended):**
69
+ ```bash
70
+ python -m venv venv
71
+ ```
72
+
73
+ 3. **Activate the virtual environment:**
74
+ * **On Windows:**
75
+ ```bash
76
+ .\venv\Scripts\activate
77
+ ```
78
+ * **On macOS/Linux:**
79
+ ```bash
80
+ source venv/bin/activate
81
+ ```
82
+
83
+ 4. **Install dependencies:**
84
+ ```bash
85
+ pip install -r requirements.txt
86
+ ```
87
+
88
+ ## Usage
89
+
90
+ Once the installation is complete, you can run the Gradio application:
91
+
92
+ 1. **Run the application:**
93
+ ```bash
94
+ python app.py
95
+ ```
96
+ This command will start the Gradio server, and you will typically see a local URL (e.g., `http://127.0.0.1:7860`) in your terminal. Open this URL in your web browser.
97
+
98
+ 2. **Using the Interface:**
99
+ * **Upload your document**: Use the "Upload your document" file input to select a file (PDF, DOCX, TXT, HTML, CSS, PY, IPYNB, CSV).
100
+ * **Adjust Chunking Parameters**: Utilize the sliders, dropdowns, and checkboxes in the "Chunking Parameters" accordion to customize `Chunk Size`, `Chunk Overlap`, `Character Splitter Separator`, `Keep Separator` behavior, `Add Start Index` to metadata, and `Strip Whitespace`.
101
+ * **Process Document**: Click the "Process Document" button. The extracted raw text will appear, and the results of various chunking methods will be displayed in their respective tabs.
102
+ * **Explore Chunks**: Navigate through the tabs ("Recursive Chunking", "Character Chunking", etc.) to see the chunks as JSON, along with the total number of chunks created for each method.
103
+ * **Python Example Code**: In each chunking tab, you can view dynamically generated Python code that demonstrates how to achieve the same chunking results programmatically.
104
+
105
+ ### Inspiration
106
+
107
+ This Gradio application is inspired by and inferred from [Mervin Praison's insightful work](https://mer.vin/2024/03/chunking-strategy/) on "Advanced Chunking Strategies."
108
+
109
+ ## Screenshots
110
+
111
+ *Interface for interacting with "Attention is All You Need 1706.03762" paper.*
112
+ ![interface](assets/1.JPG)
113
+
114
+ *Chunking results of Recursice Chunking Method.*
115
+ ![interface](assets/2.JPG)
116
+
117
+
118
+ ## Contributing
119
+
120
+ Contributions are welcome! If you have suggestions for improvements or new features, please follow these steps:
121
+
122
+ 1. Fork the repository.
123
+ 2. Create a new branch (`git checkout -b feature/YourFeature`).
124
+ 3. Make your changes.
125
+ 4. Commit your changes (`git commit -m 'Add some feature'`).
126
+ 5. Push to the branch (`git push origin feature/YourFeature`).
127
+ 6. Open a Pull Request.
128
+
129
+ ## License
130
+
131
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.