--- license: mit language: - en - de --- This model is trained for document separation for printed reviews from zbMATH Open. We had old scanned volumes of documents dating back to the 1800s, which we wanted to convert to LaTeX machine-processable format. We first converted all scanned documents to LaTeX using mathPiX and then trained an LLM to match the metadata of a document with the converted LaTeX (a single page had many documents). 1) download LLamaFactory (I recommend on this point - https://github.com/hiyouga/LLaMA-Factory/tree/36039b0fe01c17ae30dba60e247d7ba8a1beb20a , it 100% works, I did not check with the new versions) 2) save in data folder your dataset, update dataset_info (ex. of the dataset and dataset_info attached). 3) upload the model you want 4) run python3 -u LLaMA-Factory/src/train.py --stage sft --model_name_or_path (ex. louisbrulenaudet/Maxine-7B-0401-stock from huggingface, base model of mine) --adapter_name_or_path way_to_my_the_model --finetuning_type lora --template default --dataset_dir LLaMA-Factory/data --eval_dataset dataset_name --cutoff_len 10000 --max_samples 100000 --per_device_eval_batch_size 1 --predict_with_generate True --max_new_tokens 8000 --top_p 0.7 --temperature 0.95 --output_dir output_dir --do_predict True