--- license: other license_name: nvidia-open-model-license license_link: LICENSE --- ## Nemotron-4-340B-Instruct [![Model architectuve](https://img.shields.io/badge/Model%20Arch-Transformer%20Decoder-green)](#model-architecture)[![Model size](https://img.shields.io/badge/Params-340B-green)](#model-architecture)[![Language](https://img.shields.io/badge/Language-Multilingual-green)](#datasets) ### License NVIDIA Open Model License ### Model Overview Nemotron-4-340B-Instruct is a large language model (LLM) which is a fine-tuned version of the Nemotron-4-340B-Base base model, optimized for English single and multi-turn chat use-cases. The base model was pre-trained on a corpus of 8 trillion tokens consisting of a diverse assortment of English based texts, 40+ coding languages, and 50+ natural languages. Subsequently the Nemotron-4-340B-Instruct model went through additional alignment steps including: - Supervised Fine-tuning (SFT) - Direct Policy Optimization (DPO) - Additional in-house alignment techniques This results in a final model that is aligned for human chat preferences, improvements in mathematical reasoning, coding and instruction following. This model is ready for commercial use. **Model Developer:** NVIDIA **Model Input:** Text **Input Format:** String **Input Parameters:** One-Dimensional (1D) **Model Output:** Text **Output Format:** String **Output Parameters:** 1D **Model Dates:** Nemotron-4-340B-Instruct was trained between December 2023 and May 2024 **Data Freshness:** The pretraining data has a cutoff of June 2023 ### Required Hardware BF16 Inference: - 8x H200 (1x H200 Node) - 16x H100 (2x H100 Nodes) - 16x A100 (2x A100 Nodes) FP8 Inference: - 8x H100 (1x H100 Node) ### Model Architecture: The base model, Nemotron-4-340B, was trained with a global batch-size of 2304, a sequence length of 4096 tokens, uses Grouped-Query Attention (GQA), and RoPE positional embeddings. **Architecture Type:** Transformer Decoder (auto-regressive language model) ### Software Integration **Supported Hardware Architecture Compatibility:** NVIDIA H100, A100 80GB, A100 40GB ### Usage 1. We will spin up an inference server and then call the inference server in a python script. Let’s first define the python script ``call_server.py`` headers = {"Content-Type": "application/json"} def text_generation(data, ip='localhost', port=None): resp = requests.put(f'http://{ip}:{port}/generate', data=json.dumps(data), headers=headers) return resp.json() def get_generation(prompt, greedy, add_BOS, token_to_gen, min_tokens, temp, top_p, top_k, repetition, batch=False): data = { "sentences": [prompt] if not batch else prompt, "tokens_to_generate": int(token_to_gen), "temperature": temp, "add_BOS": add_BOS, "top_k": top_k, "top_p": top_p, "greedy": greedy, "all_probs": False, "repetition_penalty": repetition, "min_tokens_to_generate": int(min_tokens), "end_strings": ["<|endoftext|>", "", "\x11", "User"], } sentences = text_generation(data, port=1424)['sentences'] return sentences[0] if not batch else sentences PROMPT_TEMPLATE = """System User {prompt} Assistant """ question = "Write a poem on NVIDIA in the style of Shakespeare" prompt = PROMPT_TEMPLATE.format(prompt=question) print(prompt) response = get_generation(prompt, greedy=True, add_BOS=False, token_to_gen=1024, min_tokens=1, temp=1.0, top_p=1.0, top_k=0, repetition=1.0, batch=False) print(response) 2. Given this python script, we will create a bash script, which spins up the inference server within the [NeMo container](https://github.com/NVIDIA/NeMo/blob/main/Dockerfile) and calls the python script ``call_server.py``. The bash script ``nemo_inference.sh`` is as follows, WEB_PORT=1424 depends_on () { HOST=$1 PORT=$2 STATUS=$(curl -X PUT http://$HOST:$PORT >/dev/null 2>/dev/null; echo $?) while [ $STATUS -ne 0 ] do echo "waiting for server ($HOST:$PORT) to be up" sleep 10 STATUS=$(curl -X PUT http://$HOST:$PORT >/dev/null 2>/dev/null; echo $?) done echo "server ($HOST:$PORT) is up running" } echo "output filename: $OUTPUT_FILENAME" /usr/bin/python3 /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \ gpt_model_file=$NEMO_FILE \ pipeline_model_parallel_split_rank=0 \ server=True tensor_model_parallel_size=8 \ trainer.precision=bf16 pipeline_model_parallel_size=4 \ trainer.devices=8 \ trainer.num_nodes=4 \ web_server=False \ port=${WEB_PORT} & SERVER_PID=$! readonly local_rank="${LOCAL_RANK:=${SLURM_LOCALID:=${OMPI_COMM_WORLD_LOCAL_RANK:-}}}" if [ $SLURM_NODEID -eq 0 ] && [ $local_rank -eq 0 ]; then depends_on "0.0.0.0" ${WEB_PORT} echo "start get json" sleep 5 echo "SLURM_NODEID: $SLURM_NODEID" echo "local_rank: $local_rank" /usr/bin/python3 call_server.py echo "clean up dameons: $$" kill -9 $SERVER_PID pkill python fi wait 3, We can launch the ``nemo_inferece.sh`` with a slurm script defined like below, which starts a 4-node job for the model inference. #!/bin/bash #SBATCH -A SLURM-ACCOUNT #SBATCH -p SLURM-PARITION #SBATCH -N 4 # number of nodes #SBATCH -J generation #SBATCH --ntasks-per-node=8 #SBATCH --gpus-per-node=8 set -x read -r -d '' cmd <