Text Generation
Transformers
Safetensors
llama
text-generation-inference
Inference Endpoints
danielsteinigen's picture
add model files
6a58313
|
raw
history blame
12.7 kB
metadata
language:
  - de
  - bg
  - cs
  - da
  - el
  - en
  - es
  - et
  - fi
  - fr
  - ga
  - hr
  - hu
  - it
  - lt
  - lv
  - mt
  - nl
  - pl
  - pt
  - ro
  - sl
  - sv
  - sk
metrics:
  - accuracy
  - bleu
pipeline_tag: text-generation
library_name: transformers
base_model:
  - openGPT-X/Teuken-7B-base-v0.4

Model Card for Teuken-7B-instruct-v0.4

Teuken-7B-base-v0.4 is a 7B parameter multilingual large language model (LLM) pre-trained with 4T tokens within the research project OpenGPT-X. Teuken-7B-instruct-v0.4 is an instruction-tuned version of Teuken-7B-base-v0.4.

Model Description

  • Developed by: Fraunhofer, Forschungszentrum Jülich, TU Dresden, DFKI
  • Funded by: German Federal Ministry of Economics and Climate Protection (BMWK) in the context of the OpenGPT-X project
  • Model type: Transformer based decoder-only model
  • Language(s) (NLP): bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
  • Shared by: OpenGPT-X

Uses

Teuken-7B-instruct-v0.4 is intended for commercial and research use in all official 24 European languages. Since Teuken-7B-instruct-v0.4 focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.

Disclaimer Toxic Content:

This Language Model (LLM) may generate content that is inappropriate, offensive, or harmful. While the dataset has been heavily filtered to minimize such outputs, the model may still produce text that is biased or toxic due to the large scale and diverse nature of the data.

Out-of-Scope Use

The model is not intended for use in math and coding tasks.

Bias, Risks, and Limitations

Teuken-7B-instruct-v0.4 is an instruction-tuned version of Teuken-7B-base-v0.4 that is not completely free from biases and hallucinations.

How to Get Started with the Model

Usage

The model requires transformers, sentencepiece, and the torch library. After installation, here's an example of how to use the model:

The prompt template for the fine-tuned model is defined as follows:

user="Hi!"
lang_code = "DE"
system_messages={
            "EN": "A chat between a human and an artificial intelligence assistant."
            " The assistant gives helpful and polite answers to the human's questions.",
            "DE": "Ein Gespräch zwischen einem Menschen und einem Assistenten mit künstlicher Intelligenz."
            " Der Assistent gibt hilfreiche und höfliche Antworten auf die Fragen des Menschen.",
        }
 
prompt = f"System: {system_messages[lang_code]}\nUser: {user}\nAssistant:<s>"
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "openGPT-X/Teuken-7B-instruct-v0.4"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
model = model.to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    use_fast=False,
    trust_remote_code=True,
)
messages = [{"role": "User", "content": "Wer bist du?"}]
prompt_ids = tokenizer.apply_chat_template(messages, chat_template="DE", tokenize=True, add_generation_prompt=True, return_tensors="pt")
prediction = model.generate(
    prompt_ids.to(model.device),
    max_length=512,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.7,
    num_return_sequences=1,
)
prediction_text = tokenizer.decode(prediction[0])
print(prediction_text)

This example demonstrates how to load the model and tokenizer, prepare input, generate text, and print the result.

Training Details

Pre-Training Data

Teuken-7B-base-v0.4 was pre-trained on 4 trillion tokens of data from publicly available sources. The pretraining data has a cutoff of September 2023. More information are available in our preprint.

Instruction-Tuning Data

English

Dataset file Sample Count
en/bactrianx_EN_fastchat.jsonl 66985
en/code_alpaca_fastchat.jsonl 19990
en/evol_instruct_143k_fastchat.jsonl 142968
en/evol_instruct_70k_fastchat.jsonl 69968
en/lmsys_chat_1m_high_quality_train_en_fastchat.jsonl 18651
en/open_orca_fastchat_aa.jsonl 599968
en/open_orca_fastchat_ab.jsonl 599968
en/open_orca_fastchat_ac.jsonl 599968
en/open_orca_fastchat_ad.jsonl 599968
en/open_orca_fastchat_ag.jsonl 599968
en/open_orca_fastchat_ah.jsonl 33891
en/sharegpt_v3_unfiltered_fastchat.jsonl 93880
en/ultrachat_200k_fastchat.jsonl 11525
total 3457698

German

Dataset file Sample Count
de/bactrianx_DE_fastchat.jsonl 67017
de/freedomintelligence_alpaca_gpt4_deutsch_fastchat.jsonl 49969
de/freedomintelligence_evol_instruct_deutsch_fastchat.jsonl 59022
de/freedomintelligence_sharegpt_deutsch_fastchat.jsonl 6101
de/german_poems_fastchat.jsonl 400
de/german_songs_fastchat.jsonl 1000
de/ultrachat_de_1k_fastchat.jsonl 959
total 184468

Training Procedure

Instruction fined tuned version of Teuken-7B-base-v0.4.

Training Hyperparameters

  • Training regime: bf16 mixed precision

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Translation and MMLU. Results can be seen in the European LLM Leaderboard.

Technical Specifications

Model Architecture and Objective

Hyper-Parameter Value
Training Objective CLM
Activation Function SwiGLU
Seq Length 4096
Position Embeddings Rotary
Num Layers 32
Hidden Size 4096
FFN Hidden Size 13440
Num Attention Heads 32
Head Dim 128
Group Query Attention yes
Num Query Groups 2
Normalization RMSNorm
Learning rate 3e-4
Min learning rate 3e-5
Disable bias in linear yes
Hidden dropout 0.0
Attention dropout 0.0
Optimizer AdamW
Beta1 0.9
Beta2 0.95
Sequence-parallelism
Data-type bf16
Recompute-activations yes
Distributed-optimizers yes
Model Initialization

Compute Infrastructure

We trained our models on JUWELS Booster which consists of 936 compute nodes, each equipped with 4 NVIDIA A100 GPUs. The GPUs are hosted by AMD EPYC Rome CPUs. The compute nodes are connected with HDR-200 InfiniBand in a DragonFly+ topology.

Hardware

The configuration of JUWELS Booster compute nodes is the following:

CPU: AMD EPYC 7402 processor; 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 1 configuration
Memory: 512 GB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)
GPU: 4 × NVIDIA A100 Tensor Core GPU with 40 GB; connected via NVLink3 to each other
Network: 4 × Mellanox HDR200 InfiniBand ConnectX 6 (200 Gbit/s each), HCA
Periphery: CPU, GPU, and network adapter are connected via 2 PCIe Gen 4 switches with 16 PCIe lanes going to each device (CPU socket: 2×16 lanes). PCIe switches are configured in synthetic mode.

Software

Megatron-LM

BibTeX:

If you find our model useful in your research, please consider citing our preprint:

@misc{ali2024teuken7bbaseteuken7binstructeuropean,
      title={Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs}, 
      author={Mehdi Ali and Michael Fromm and Klaudia Thellmann and Jan Ebert and Alexander Arno Weber and Richard Rutmann and Charvi Jain and Max Lübbering and Daniel Steinigen and Johannes Leveling and Katrin Klug and Jasper Schulze Buschhoff and Lena Jurkschat and Hammam Abdelwahab and Benny Jörg Stein and Karl-Heinz Sylla and Pavel Denisov and Nicolo' Brandizzi and Qasid Saleem and Anirban Bhowmick and Lennard Helmer and Chelsea John and Pedro Ortiz Suarez and Malte Ostendorff and Alex Jude and Lalith Manjunath and Samuel Weinbach and Carolin Penke and Oleg Filatov and Shima Asaadi and Fabio Barth and Rafet Sifa and Fabian Küch and Andreas Herten and René Jäkel and Georg Rehm and Stefan Kesselheim and Joachim Köhler and Nicolas Flores-Herr},
      year={2024},
      eprint={2410.03730},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.03730}, 
}

Team

Data Team

Anirban Bhowmick (IAIS), Nicolo Brandizzi (IAIS), Lennard Helmer (IAIS), Benny Jörg Stein (IAIS), Karl-Heinz Sylla (IAIS), Pavel Denisov (IAIS), Qasid Saleem (IAIS), Johannes Leveling (IAIS), Hammam Abdelwahab (IAIS), Luzian Hahn (IIS), Farzad Naderi (IIS), Md Saiful Islam (IIS), Alexander Schwirjow (IIS), Pedro Ortiz Suarez (ex. DFKI), Malte Ostendorff (ex. DFKI)

Model-Training Team

Core contributors

Mehdi Ali (IAIS), Michael Fromm (IAIS), Jan Ebert (FZJ), Chelsea John (FZJ), Lena Jurkschat (TUD), Alexander Weber (IAIS)

Contributors:

Richard Rutmann (IAIS), Daniel Steinigen (IAIS), Lalith Manjunath (TUD), Carolin Penke (FZJ)

Evaluation Team

Core contributors

Klaudia Thellmann (TUD), Alex Jude (IAIS), Jasper Buschhoff (IAIS)

Contributors:

Shima Assadi (IIS), Fabio Barth (DFKI)

Management

Joachim Köhler (IAIS), Nicolas Flores-Herr (IAIS), Stefan Kesselheim (FZJ), Andreas Herten (FZJ), Georg Rehm (DFKI), René Jäkel (TUD), Fabian Küch (IIS), Nicole Hildebrandt (IAIS), Ines Wendler (IAIS)

Contact Information

You can reach out to the following model card contact: