|
--- |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
## Objective |
|
|
|
Normalizer is a tool that processes various file types to extract prompt-response pairs for finetuning LLM models on unstructured data. |
|
|
|
The end-state will be a GUI where users of any technical level can drag and drop multiple files at once into the application, then they hit a 'create fine-tuning data' button and they'll receive a .csv with a system prompt and a series of matched pair prompt-completion responses. |
|
|
|
## Working Repository |
|
|
|
https://github.com/anarchy-ai/normalizer |
|
|
|
## Input Formats |
|
|
|
- **Text**: .txt, .md, .mdx |
|
- **Documents**: .pdf, .doc, .docx |
|
- **Spreadsheets**: .xlsx, .csv |
|
- **Code**: .py, .js, .html, .java |
|
- **Images**: .jpg, .png (will require OCR) |
|
|
|
## Output Format |
|
|
|
A .csv file with 3 columns: |
|
|
|
| System Prompt | User Prompt | Response | |
|
| -------------------------------------- | ------------------------------------- | ---------------------------------------------| |
|
| Hi, how can I help you today? | | |
|
| | What do these lab results suggest? | These lab results suggest that the patient is healthy, as no anomalous data has been detected. | |
|
| | What is the sentiment of the last 5 customers who came into support chat? | The last five customers have a neutral to positive sentiment. | |
|
| | ... | ... | |
|
|
|
## Required Libraries |
|
|
|
```pip install PyPDF2 python-docx pandas openpyxl pillow pytesseract beautifulsoup4 transformers datasets``` |
|
|
|
## Project Structure |
|
|
|
```CSS |
|
normalizer/ |
|
β |
|
βββ src/ |
|
β βββ __init__.py |
|
β βββ file_ingest.py |
|
β βββ prompt_extractor.py |
|
β βββ main.py |
|
β |
|
βββ app.py |
|
βββ requirements.txt |
|
βββ README.md |
|
βββ .gitignore |
|
``` |