normalizer / README.md
maxsonderby's picture
Update README.md
27ea30d verified
---
license: mit
language:
- en
---
## Objective
Normalizer is a tool that processes various file types to extract prompt-response pairs for finetuning LLM models on unstructured data.
The end-state will be a GUI where users of any technical level can drag and drop multiple files at once into the application, then they hit a 'create fine-tuning data' button and they'll receive a .csv with a system prompt and a series of matched pair prompt-completion responses.
## Working Repository
https://github.com/anarchy-ai/normalizer
## Input Formats
- **Text**: .txt, .md, .mdx
- **Documents**: .pdf, .doc, .docx
- **Spreadsheets**: .xlsx, .csv
- **Code**: .py, .js, .html, .java
- **Images**: .jpg, .png (will require OCR)
## Output Format
A .csv file with 3 columns:
| System Prompt | User Prompt | Response |
| -------------------------------------- | ------------------------------------- | ---------------------------------------------|
| Hi, how can I help you today? | |
| | What do these lab results suggest? | These lab results suggest that the patient is healthy, as no anomalous data has been detected. |
| | What is the sentiment of the last 5 customers who came into support chat? | The last five customers have a neutral to positive sentiment. |
| | ... | ... |
## Required Libraries
```pip install PyPDF2 python-docx pandas openpyxl pillow pytesseract beautifulsoup4 transformers datasets```
## Project Structure
```CSS
normalizer/
β”‚
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ file_ingest.py
β”‚ β”œβ”€β”€ prompt_extractor.py
β”‚ β”œβ”€β”€ main.py
β”‚
β”œβ”€β”€ app.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
└── .gitignore
```