nvidia/Nemotron-4-340B-Instruct

Jul 10, 2024

We have already installed apex,then also we are getting this import error. Could you please look into this and suggest a solution for solving this issue

Error executing job with overrides: ['gpt_model_file=/home/new_env/Nemotron-4-340B-Instruct', 'pipeline_model_parallel_split_rank=0', 'server=True', 'tensor_model_parallel_size=8', 'trainer.precision=bf16', 'pipeline_model_parallel_size=2', 'trainer.devices=8', 'trainer.num_nodes=2', 'web_server=False', 'port=1424']
Traceback (most recent call last):
File "/home/new_env/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py", line 178, in main
strategy=NLPDDPStrategy(timeout=datetime.timedelta(seconds=18000)),
File "home/new_env/lib/python3.10/site-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 172, in init
raise ImportError(
ImportError: Apex was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/NeMo#megatron-gpt.

Apex Details

Name: apex
Version: 0.1
Summary: PyTorch Extensions written by NVIDIA
Home-page: UNKNOWN
Author:
Author-email:
License: UNKNOWN
Location: /home/new_env/lib/python3.10/site-packages
Requires: packaging
Required-by:

odelalleau

NVIDIA org Jul 10, 2024

Would be helpful to have some additional info:

Are you using a Docker container? If yes, what is the Dockerfile? If not, which version of NeMo are you using?
Can you manually run python, try those various import statements and report which one(s) fail(s)?

import apex
from apex.transformer.pipeline_parallel.utils import get_num_microbatches
from nemo.core.optim.distributed_adam import MegatronDistributedFusedAdam

RoshanJoe

Jul 11, 2024

We have tried with above import statements as mentioned ,but still we are facing some isssues. Could you please check.

NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5
python 3.10
open-clip-torch 2.24.0
pytorch-lightning 2.0.7
torch 2.3.1
torchdiffeq 0.2.4
torchmetrics 1.4.0.post0
torchsde 0.2.6
torchvision 0.18.1

These is our env.

Traceback (most recent call last):
File "/home/Setup/new_env/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py", line 26, in
from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/init.py", line 15, in
from nemo.collections.nlp import data, losses, models, modules
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/data/init.py", line 42, in
from nemo.collections.nlp.data.zero_shot_intent_recognition.zero_shot_intent_dataset import (
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/data/zero_shot_intent_recognition/init.py", line 16, in
from nemo.collections.nlp.data.zero_shot_intent_recognition.zero_shot_intent_dataset import (
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/data/zero_shot_intent_recognition/zero_shot_intent_dataset.py", line 30, in
from nemo.collections.nlp.parts.utils_funcs import tensor2list
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/parts/init.py", line 17, in
from nemo.collections.nlp.parts.utils_funcs import list2str, tensor2list
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/parts/utils_funcs.py", line 37, in
from nemo.collections.nlp.modules.common.megatron.utils import erf_gelu
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/modules/init.py", line 16, in
from nemo.collections.nlp.modules.common import (
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/modules/common/init.py", line 36, in
from nemo.collections.nlp.modules.common.tokenizer_utils import get_tokenizer, get_tokenizer_list
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/modules/common/tokenizer_utils.py", line 29, in
from nemo.collections.nlp.parts.nlp_overrides import HAVE_MEGATRON_CORE
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 23, in
from nemo.core.optim.distributed_adam import MegatronDistributedFusedAdam
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/core/optim/distributed_adam.py", line 19, in
from apex.contrib.optimizers.distributed_fused_adam import (
File "/home/Setup/new_env/lib/python3.10/site-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 31, in
import amp_C
ModuleNotFoundError: No module named 'amp_C'

odelalleau

NVIDIA org Jul 11, 2024

Ok thanks, looks like your Apex install is broken somehow. Maybe try what's mentioned in https://github.com/NVIDIA/apex/issues/1757 (or search for more related issues).
It's highly recommended to use the container from the model card as manual setup can indeed be tricky.

RoshanJoe

Jul 17, 2024

Could you please help us by sharing the link of the container in the model card

odelalleau

NVIDIA org Jul 17, 2024

The link is in the model card as far as I can tell (pull command: docker pull nvcr.io/nvidia/nemo:24.05) -- is that not sufficient?

RoshanJoe

Jul 22, 2024

•

edited Jul 22, 2024

I have executed the
1.docker pull nvcr.io/nvidia/nemo:24.05
2. docker run --gpus all -it --rm -v :/NeMo --shm-size=8g
-p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit
stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:23.10-py3 command

After that how can I connect to my Nemotron-4-340B-Instruct model

odelalleau

NVIDIA org Jul 22, 2024

It's likely that you won't be able to run inference on a single node for this model (you'd need an FP8 checkpoint, which hasn't been released yet). This makes things a bit more complex, and is the reason why the model card "Usage" section relies on SLURM for two-node inference.

I'm not actually sure how to run two-node inference manually, but you'd need to execute megatron_gpt_eval.py on both nodes, something like

/usr/bin/python3 /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \
        gpt_model_file=$NEMO_FILE \
        pipeline_model_parallel_split_rank=0 \
        server=True tensor_model_parallel_size=8 \
        trainer.precision=bf16 pipeline_model_parallel_size=2 \
        trainer.devices=8 \
        trainer.num_nodes=2 \
        web_server=False \
        port=1424

(the part I'm not sure about is how to get the two nodes to know about each other -- the underlying NeMo code is based on PyTorch Lightning so you may need to check its docs on how to do multi-node, maybe with torchrun?)

One you manage to launch the model server, it should be easy to call it by doing something like the call_server.py script in the model card.

RoshanJoe changed discussion status to closed Jul 23, 2024

RoshanJoe changed discussion status to open Jul 23, 2024

nvidia
/

Nemotron-4-340B-Instruct

Apex Error

Apex Details