penfever's picture
Upload folder using huggingface_hub
4744c5d verified
[2025-02-01 18:47:39,024][oumi][rank3][pid:11753][MainThread][INFO]][train.py:144] Resolved 'training.dataloader_num_workers=auto' to 'training.dataloader_num_workers=8'
[2025-02-01 18:47:39,326][oumi][rank3][pid:11753][MainThread][INFO]][models.py:180] Building model for distributed training (world_size: 4)...
[2025-02-01 18:47:39,326][oumi][rank3][pid:11753][MainThread][INFO]][models.py:185] Building model using device_map: cuda:3 (DeviceRankInfo(world_size=4, rank=3, local_world_size=4, local_rank=3))...
[2025-02-01 18:47:39,327][oumi][rank3][pid:11753][MainThread][INFO]][models.py:255] Using model class: <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'> to instantiate model.
[2025-02-01 18:47:41,529][oumi][rank3][pid:11753][MainThread][INFO]][base_map_dataset.py:68] Creating map dataset (type: TextSftJsonLinesDataset) dataset_name: 'text_sft_jsonl', dataset_path: 'None'...
[2025-02-01 18:47:41,714][oumi][rank3][pid:11753][MainThread][INFO]][base_map_dataset.py:297] TextSftJsonLinesDataset: features=dict_keys(['input_ids', 'attention_mask'])
[2025-02-01 18:47:47,694][oumi][rank3][pid:11753][MainThread][INFO]][base_map_dataset.py:361] Finished transforming dataset (TextSftJsonLinesDataset)! Speed: 1672.23 examples/sec. Examples: 10000. Duration: 6.0 sec. Transform workers: 1.
[2025-02-01 18:47:47,964][oumi][rank3][pid:11753][MainThread][INFO]][torch_profiler_utils.py:150] PROF: Torch Profiler disabled!
[2025-02-01 18:47:48,076][oumi][rank3][pid:11753][MainThread][INFO]][device_utils.py:283] GPU Metrics Before Training: GPU runtime info: NVidiaGpuRuntimeInfo(device_index=0, device_count=4, used_memory_mb=7019.0, temperature=33, fan_speed=None, fan_speeds=None, power_usage_watts=70.637, power_limit_watts=400.0, gpu_utilization=0, memory_utilization=0, performance_state=0, clock_speed_graphics=1155, clock_speed_sm=1155, clock_speed_memory=1593).
[2025-02-01 18:47:48,078][oumi][rank3][pid:11753][MainThread][INFO]][train.py:312] Training init time: 10.795s
[2025-02-01 18:47:48,078][oumi][rank3][pid:11753][MainThread][INFO]][train.py:313] Starting training... (TrainerType.TRL_SFT, transformers: 4.45.2)
[2025-02-01 18:52:35,469][oumi][rank3][pid:11753][MainThread][INFO]][train.py:320] Training is Complete.
[2025-02-01 18:52:35,496][oumi][rank3][pid:11753][MainThread][INFO]][device_utils.py:283] GPU Metrics After Training: GPU runtime info: NVidiaGpuRuntimeInfo(device_index=0, device_count=4, used_memory_mb=21283.0, temperature=43, fan_speed=None, fan_speeds=None, power_usage_watts=181.852, power_limit_watts=400.0, gpu_utilization=54, memory_utilization=14, performance_state=0, clock_speed_graphics=1410, clock_speed_sm=1410, clock_speed_memory=1593).
[2025-02-01 18:52:35,497][oumi][rank3][pid:11753][MainThread][INFO]][torch_utils.py:117] Peak GPU memory usage: 16.56 GB
[2025-02-01 18:52:35,497][oumi][rank3][pid:11753][MainThread][INFO]][train.py:327] Saving final state...
[2025-02-01 18:52:35,504][oumi][rank3][pid:11753][MainThread][INFO]][train.py:332] Saving final model...
[2025-02-01 18:52:43,652][oumi][rank3][pid:11753][MainThread][INFO]][train.py:339]
» We're always looking for feedback. What's one thing we can improve? https://oumi.ai/feedback