W1127 19:57:11.551000 371702 site-packages/torch/distributed/run.py:793] W1127 19:57:11.551000 371702 site-packages/torch/distributed/run.py:793] ***************************************** W1127 19:57:11.551000 371702 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1127 19:57:11.551000 371702 site-packages/torch/distributed/run.py:793] ***************************************** Trainer._get_train_sampler replaced with custom implementation. Trainer._get_train_sampler replaced with custom implementation. Trainer._get_train_sampler replaced with custom implementation. Trainer._get_train_sampler replaced with custom implementation. Trainer._get_train_sampler replaced with custom implementation. Trainer._get_train_sampler replaced with custom implementation. Trainer._get_train_sampler replaced with custom implementation. [2025-11-27 19:57:17,989] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-27 19:57:18,005] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-27 19:57:18,096] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-27 19:57:18,187] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-27 19:57:18,272] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-27 19:57:18,329] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-27 19:57:18,336] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) Trainer._get_train_sampler replaced with custom implementation. [2025-11-27 19:57:18,699] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-27 19:57:18,815] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-27 19:57:18,844] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-27 19:57:18,901] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-27 19:57:18,953] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-27 19:57:18,953] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2025-11-27 19:57:19,049] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-27 19:57:19,060] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-27 19:57:19,148] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-27 19:57:19,733] [INFO] [comm.py:658:init_distributed] cdb=None FlashAttention 3 is available You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. FlashAttention 3 is available You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. FlashAttention 3 is available You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. FlashAttention 3 is available FlashAttention 3 is available FlashAttention 3 is available You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. FlashAttention 3 is available You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. FlashAttention 3 is available You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/2 [00:00