ahmedheakl's picture
Update README.md
b15b265 verified
metadata
license: mit
datasets:
  - ahmedheakl/arzen-llm-dataset
language:
  - ar
  - en
metrics:
  - bleu
  - ecody726/bertscore
  - meteor
library_name: transformers
pipeline_tag: translation

How to use

Just install peft, transformers, 'accelerate', 'bitsandbytes' and pytorch first.

pip install peft accelerate bitsandbytes transformers torch

Then login with your huggingface token to get access to base models

huggingface-cli login --token <YOUR_HF_TOKEN>

Then load the model.

from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

peft_model_id = "ahmedheakl/arazn-llama3-english"
peft_config = PeftConfig.from_pretrained(peft_model_id)
base_model_name = peft_config.base_model_name_or_path
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, device_map="auto", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base_model, peft_model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

Then do inference

import torch

raw_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Translate the following code-switched Arabic-English-mixed text to English only.<|eot_id|><|start_header_id|>user<|end_header_id|>

{source}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
def inference(prompt) -> str:
    prompt = raw_prompt.format(source=prompt)
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    generated_ids = model.generate(
        **inputs,
        use_cache=True,
        num_return_sequences=1,
        max_new_tokens=100,
        # do_sample=True,
        num_beams=1,
      #  temperature=0.7,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )
    outputs = tokenizer.batch_decode(generated_ids)[0]
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    return outputs.split("assistant<|end_header_id|>\n\n")[-1].split("<|eot_id|>")[0]
print(inference("أنا أحب الbanana")) # I love bananas

Please see paper & code for more information:

Citation

BibTeX:

@article{heakl2024arzen,
  title={ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs},
  author={Heakl, Ahmed and Zaghloul, Youssef and Ali, Mennatullah and Hossam, Rania and Gomaa, Walid},
  journal={arXiv preprint arXiv:2406.18120},
  year={2024}
}

Model Card Authors