AnkitSatpute
commited on
Update Readme for model description
Browse files
README.md
CHANGED
@@ -1,3 +1,17 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- de
|
6 |
+
---
|
7 |
+
|
8 |
+
This model is trained for document separation for printed reviews from zbMATH Open.
|
9 |
+
We had old scanned volumes of documents dating back to the 1800s, which we wanted to convert to LaTeX machine-processable format. We first converted all scanned documents to LaTeX using
|
10 |
+
mathPiX and then trained an LLM to match the metadata of a document with the converted LaTeX (a single page had many documents).
|
11 |
+
|
12 |
+
|
13 |
+
1) download LLamaFactory (I recommend on this point - https://github.com/hiyouga/LLaMA-Factory/tree/36039b0fe01c17ae30dba60e247d7ba8a1beb20a , it 100% works, I did not check with the new versions)
|
14 |
+
2) save in data folder your dataset, update dataset_info (ex. of the dataset and dataset_info attached).
|
15 |
+
3) upload the model you want
|
16 |
+
4) run
|
17 |
+
python3 -u LLaMA-Factory/src/train.py --stage sft --model_name_or_path (ex. louisbrulenaudet/Maxine-7B-0401-stock from huggingface, base model of mine) --adapter_name_or_path way_to_my_the_model --finetuning_type lora --template default --dataset_dir LLaMA-Factory/data --eval_dataset dataset_name --cutoff_len 10000 --max_samples 100000 --per_device_eval_batch_size 1 --predict_with_generate True --max_new_tokens 8000 --top_p 0.7 --temperature 0.95 --output_dir output_dir --do_predict True
|