AnarchyAI
/

normalizer

Model card Files Files and versions Community

normalizer / README.md

maxsonderby's picture

Update README.md

27ea30d verified about 1 year ago

|

history blame contribute delete

1.92 kB

	---
	license: mit
	language:
	- en
	---
	## Objective

	Normalizer is a tool that processes various file types to extract prompt-response pairs for finetuning LLM models on unstructured data.

	The end-state will be a GUI where users of any technical level can drag and drop multiple files at once into the application, then they hit a 'create fine-tuning data' button and they'll receive a .csv with a system prompt and a series of matched pair prompt-completion responses.

	## Working Repository

	https://github.com/anarchy-ai/normalizer

	## Input Formats

	- Text: .txt, .md, .mdx
	- Documents: .pdf, .doc, .docx
	- Spreadsheets: .xlsx, .csv
	- Code: .py, .js, .html, .java
	- Images: .jpg, .png (will require OCR)

	## Output Format

	A .csv file with 3 columns:

	\| System Prompt \| User Prompt \| Response \|
	\| -------------------------------------- \| ------------------------------------- \| ---------------------------------------------\|
	\| Hi, how can I help you today? \| \|
	\| \| What do these lab results suggest? \| These lab results suggest that the patient is healthy, as no anomalous data has been detected. \|
	\| \| What is the sentiment of the last 5 customers who came into support chat? \| The last five customers have a neutral to positive sentiment. \|
	\| \| ... \| ... \|

	## Required Libraries

	```pip install PyPDF2 python-docx pandas openpyxl pillow pytesseract beautifulsoup4 transformers datasets```

	## Project Structure

	```CSS
	normalizer/
	│
	├── src/
	│ ├── __init__.py
	│ ├── file_ingest.py
	│ ├── prompt_extractor.py
	│ ├── main.py
	│
	├── app.py
	├── requirements.txt
	├── README.md
	└── .gitignore
	```