NeoXArgs.from_ymls() ['../input/ft-pythia-160m.yml'] INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 1 -------------------- arguments -------------------- attention_config ................ ['flash', 'flash', 'flash', 'flash', 'flash', 'flash', 'flash', 'flash', 'flash', 'flash', 'flash', 'flash']updated attention_dropout ............... 0...........................updated batch_size ...................... 32..........................updated bias_gelu_fusion ................ True........................updated checkpoint_activations .......... True........................updated checkpoint_factor ............... 1000........................updated clip_grad ....................... 1.0.........................updated config_files .................... {'ft-pythia-160m.yml': '{\n # parallelism settings\n "pipe-parallel-size": 1,\n "model-parallel-size": 1,\n\n # model settings\n "num-layers": 12,\n "hidden-size": 768,\n "num-attention-heads": 12,\n "seq-length": 2048,\n "max-position-embeddings": 2048,\n "pos-emb": "rotary",\n "rotary-pct": 0.25,\n "no-weight-tying": true,\n "gpt-j-residual": true,\n "output-layer-parallelism": "column",\n \n "attention-config": [[["flash"], 12]],\n\n "scaled-upper-triang-masked-softmax-fusion": true,\n "bias-gelu-fusion": true,\n\n # init methods\n "init_method": "small_init",\n "output_layer_init_method": "wang_init",\n\n "optimizer": {\n "type": "Adam",\n "params": {\n "lr": 0.0006,\n "betas": [0.9, 0.95],\n "eps": 1.0e-8\n }\n },\n "min_lr": 0.00006,\n\n "zero_optimization": {\n "stage": 1,\n "allgather_partitions": true,\n "allgather_bucket_size": 500000000,\n "overlap_comm": true,\n "reduce_scatter": true,\n "reduce_bucket_size": 500000000,\n "contiguous_gradients": true,\n "cpu_offload": false\n },\n\n # batch size (trained on 32 gpus)\n "train_micro_batch_size_per_gpu": 32,\n "gradient_accumulation_steps": 32,\n "gas": 1,\n "data-impl": "mmap",\n "num_workers": 1,\n\n # activation checkpointing\n "checkpoint-activations": true,\n "checkpoint-num-layers": 1,\n "partition-activations": true,\n "synchronize-each-layer": true,\n\n # regularization\n "gradient_clipping": 1.0,\n "weight-decay": 0.1,\n "hidden-dropout": 0,\n "attention-dropout": 0,\n\n # precision settings\n "fp16": {\n "fp16": true,\n "enabled": true,\n "loss_scale": 0,\n "loss_scale_window": 1000,\n "initial_scale_power": 12,\n "hysteresis": 2,\n "min_loss_scale": 1\n },\n\n "train-iters": 143000,\n "lr-decay-iters": 143000,\n "distributed-backend": "nccl",\n "lr-decay-style": "cosine",\n "warmup": 0.01,\n "checkpoint-factor": 1000,\n "extra-save-iters": [0,1,2,4,8,16,32,64,128,256,512],\n "eval-interval": 40000,\n "eval-iters": 10,\n\n "log-interval": 10,\n "steps_per_print": 10,\n "wall_clock_breakdown": true,\n\n "train-data-paths": ["../input/pythia_mydata_idxmaps/mydata_left_text_document"],\n "valid-data-paths": ["../input/pythia_mydata_idxmaps/mydata_left_text_document"],\n "test-data-paths": ["../input/pythia_mydata_idxmaps/mydata_left_text_document"],\n\n "tokenizer-type": "HFTokenizer",\n "vocab-file": "../input/20B_tokenizer.json",\n\n "launcher": "slurm",\n "deepspeed_slurm": false,\n\n "save": "../checkpoints/ft-left-pythia160m",\n "load": "EleutherAI/pythia-160m",\n "checkpoint_validation_with_forward_pass": False,\n\n}\n'}updated data_impl ....................... mmap........................updated dynamic_loss_scale .............. True........................updated eval_interval ................... 40000.......................updated eval_iters ...................... 10..........................updated extra_save_iters ................ [0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512]updated fp16 ............................ {'fp16': True, 'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'initial_scale_power': 12, 'hysteresis': 2, 'min_loss_scale': 1}updated gas ............................. 32..........................updated global_num_gpus ................. 1...........................updated gpt_j_residual .................. True........................updated gradient_accumulation_steps ..... 32..........................updated gradient_clipping ............... 1.0.........................updated hidden_dropout .................. 0...........................updated hidden_size ..................... 768.........................updated init_method ..................... small_init..................updated is_pipe_parallel ................ True........................updated launcher ........................ slurm.......................updated load ............................ EleutherAI/pythia-160m......updated log_interval .................... 10..........................updated lr .............................. 0.0006......................updated lr_decay_iters .................. 143000......................updated lr_decay_style .................. cosine......................updated max_position_embeddings ......... 2048........................updated min_lr .......................... 6e-05.......................updated no_weight_tying ................. True........................updated num_attention_heads ............. 12..........................updated num_layers ...................... 12..........................updated num_workers ..................... 1...........................updated optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.95], 'eps': 1e-08}}updated optimizer_type .................. Adam........................updated output_layer_init_method ........ wang_init...................updated output_layer_parallelism ........ column......................updated partition_activations ........... True........................updated pipe_parallel_size .............. 1...........................updated pos_emb ......................... rotary......................updated precision ....................... fp16........................updated rotary_pct ...................... 0.25........................updated save ............................ ../checkpoints/ft-left-pythia160mupdated save_iters ...................... [0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000, 18000, 19000, 20000, 21000, 22000, 23000, 24000, 25000, 26000, 27000, 28000, 29000, 30000, 31000, 32000, 33000, 34000, 35000, 36000, 37000, 38000, 39000, 40000, 41000, 42000, 43000, 44000, 45000, 46000, 47000, 48000, 49000, 50000, 51000, 52000, 53000, 54000, 55000, 56000, 57000, 58000, 59000, 60000, 61000, 62000, 63000, 64000, 65000, 66000, 67000, 68000, 69000, 70000, 71000, 72000, 73000, 74000, 75000, 76000, 77000, 78000, 79000, 80000, 81000, 82000, 83000, 84000, 85000, 86000, 87000, 88000, 89000, 90000, 91000, 92000, 93000, 94000, 95000, 96000, 97000, 98000, 99000, 100000, 101000, 102000, 103000, 104000, 105000, 106000, 107000, 108000, 109000, 110000, 111000, 112000, 113000, 114000, 115000, 116000, 117000, 118000, 119000, 120000, 121000, 122000, 123000, 124000, 125000, 126000, 127000, 128000, 129000, 130000, 131000, 132000, 133000, 134000, 135000, 136000, 137000, 138000, 139000, 140000, 141000, 142000]updated scaled_upper_triang_masked_softmax_fusion True...............updated seq_length ...................... 2048........................updated sparsity_config ................. {}..........................updated synchronize_each_layer .......... True........................updated test_data_paths ................. ['../input/pythia_mydata_idxmaps/mydata_left_text_document']updated test_data_weights ............... [1.0].......................updated text_gen_type ................... unconditional...............updated tokenizer_type .................. HFTokenizer.................updated train_batch_size ................ 1024........................updated train_data_paths ................ ['../input/pythia_mydata_idxmaps/mydata_left_text_document']updated train_data_weights .............. [1.0].......................updated train_iters ..................... 143000......................updated train_micro_batch_size_per_gpu .. 32..........................updated user_script ..................... train.py....................updated valid_data_paths ................ ['../input/pythia_mydata_idxmaps/mydata_left_text_document']updated valid_data_weights .............. [1.0].......................updated vocab_file ...................... ../input/20B_tokenizer.json.updated wall_clock_breakdown ............ True........................updated wandb_group ..................... H3hTSTVUCE7vUsqcaWJsVe_el97yuzzupdated weight_decay .................... 0.1.........................updated zero_allgather_bucket_size ...... 500000000...................updated zero_contiguous_gradients ....... True........................updated zero_optimization ............... {'stage': 1, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': False}updated zero_reduce_bucket_size ......... 500000000...................updated zero_reduce_scatter ............. True........................updated zero_stage ...................... 1...........................updated activation ...................... gelu........................default adlr_autoresume ................. False.......................default adlr_autoresume_interval ........ 1000........................default amp ............................. None........................default apply_query_key_layer_scaling ... False.......................default attention_softmax_in_fp32 ....... False.......................default bias_dropout_fusion ............. False.......................default char_level_ppl .................. False.......................default checkpoint_in_cpu ............... False.......................default checkpoint_num_layers ........... 1...........................default checkpoint_scale ................ linear......................default checkpoint_validation_with_forward_pass False................default comment ......................... None........................default contiguous_checkpointing ........ False.......................default data_path ....................... None........................default deepscale ....................... False.......................default deepscale_config ................ None........................default deepspeed ....................... True........................default deepspeed_activation_checkpointing True......................default deepspeed_mpi ................... False.......................default deepspeed_slurm ................. False.......................default detect_nvlink_pairs ............. False.......................default distributed_backend ............. nccl........................default do_test ......................... None........................default do_train ........................ None........................default do_valid ........................ None........................default dump_state ...................... False.......................default eod_mask_loss ................... False.......................default eval_results_prefix ............. ............................default eval_tasks ...................... None........................default exclude ......................... None........................default exit_interval ................... None........................default finetune ........................ False.......................default flops_profiler .................. None........................default fp16_lm_cross_entropy ........... False.......................default fp32_allreduce .................. False.......................default git_hash ........................ 71df4d50....................default gmlp_attn_dim ................... 64..........................default gpt_j_tied ...................... False.......................default gradient_noise_scale_cpu_offload False.......................default gradient_noise_scale_n_batches .. 5...........................default gradient_predivide_factor ....... 1.0.........................default hostfile ........................ None........................default hysteresis ...................... 2...........................default include ......................... None........................default init_method_std ................. 0.02........................default iteration ....................... None........................default keep_last_n_checkpoints ......... None........................default layernorm_epsilon ............... 1e-05.......................default lazy_mpu_init ................... False.......................default local_rank ...................... None........................default log_dir ......................... None........................default log_grad_norm ................... False.......................default log_grad_pct_zeros .............. False.......................default log_gradient_noise_scale ........ False.......................default log_optimizer_states ............ False.......................default log_param_norm .................. False.......................default loss_scale ...................... None........................default loss_scale_window ............... 1000.0......................default make_vocab_size_divisible_by .... 128.........................default master_addr ..................... None........................default master_port ..................... 29500.......................default maximum_tokens .................. 64..........................default merge_file ...................... None........................default min_scale ....................... 1.0.........................default mmap_warmup ..................... False.......................default model_parallel_size ............. 1...........................default no_load_optim ................... False.......................default no_load_rng ..................... False.......................default no_save_optim ................... False.......................default no_save_rng ..................... False.......................default norm ............................ layernorm...................default num_gpus ........................ None........................default num_nodes ....................... -1..........................default num_samples ..................... 1...........................default num_unique_layers ............... None........................default onnx_safe ....................... False.......................default opt_pos_emb_offset .............. 0...........................default override_lr_scheduler ........... False.......................default padded_vocab_size ............... None........................default param_sharing_style ............. grouped.....................default pipe_partition_method ........... type:transformer|mlp........default prescale_gradients .............. False.......................default profile_backward ................ False.......................default prompt_end ...................... ...........................default rank ............................ None........................default recompute ....................... False.......................default return_logits ................... False.......................default rms_norm_epsilon ................ 1e-08.......................default rotary_emb_base ................. 10000.......................default rpe_max_distance ................ 128.........................default rpe_num_buckets ................. 32..........................default sample_input_file ............... None........................default sample_output_file .............. samples.txt.................default scaled_masked_softmax_fusion .... False.......................default scalenorm_epsilon ............... 1e-08.......................default scheduler ....................... None........................default seed ............................ 1234........................default short_seq_prob .................. 0.1.........................default soft_prompt_tuning .............. None........................default sparse_gradients ................ False.......................default split ........................... 969, 30, 1..................default steps_per_print ................. 10..........................default temperature ..................... 0.0.........................default tensorboard_dir ................. None........................default top_k ........................... 0...........................default top_p ........................... 0.0.........................default use_bnb_optimizer ............... False.......................default use_checkpoint_lr_scheduler ..... False.......................default use_cpu_initialization .......... False.......................default use_shared_fs ................... True........................default use_wandb ....................... None........................default wandb_host ...................... https://api.wandb.ai........default wandb_init_all_ranks ............ False.......................default wandb_project ................... neox........................default wandb_team ...................... None........................default warmup .......................... 0.01........................default weight_by_num_documents ......... False.......................default weighted_sampler_alpha .......... 0.3.........................default world_size ...................... None........................default zero_allow_untested_optimizer ... False.......................default ---------------- end of arguments ---------------- [2025-07-22 13:10:06,552] [WARNING] [runner.py:126:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2025-07-22 13:10:06,552] [INFO] [runner.py:366:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 train.py --deepspeed_config {"train_batch_size": 1024, "train_micro_batch_size_per_gpu": 32, "gradient_accumulation_steps": 32, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true} --megatron_config {"launcher": "slurm", "train_batch_size": 1024, "train_micro_batch_size_per_gpu": 32, "gradient_accumulation_steps": 32, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["flash", "flash", "flash", "flash", "flash", "flash", "flash", "flash", "flash", "flash", "flash", "flash"], "sparsity_config": {}, "scaled_upper_triang_masked_softmax_fusion": true, "bias_gelu_fusion": true, "rotary_pct": 0.25, "init_method": "small_init", "output_layer_init_method": "wang_init", "gpt_j_residual": true, "output_layer_parallelism": "column", "lr_decay_style": "cosine", "lr_decay_iters": 143000, "min_lr": 6e-05, "optimizer_type": "Adam", "zero_stage": 1, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "tokenizer_type": "HFTokenizer", "train_data_paths": ["../input/pythia_mydata_idxmaps/mydata_left_text_document"], "test_data_paths": ["../input/pythia_mydata_idxmaps/mydata_left_text_document"], "valid_data_paths": ["../input/pythia_mydata_idxmaps/mydata_left_text_document"], "train_data_weights": [1.0], "valid_data_weights": [1.0], "test_data_weights": [1.0], "data_impl": "mmap", "save": "../checkpoints/ft-left-pythia160m", "config_files": {"ft-pythia-160m.yml": "{\n # parallelism settings\n \"pipe-parallel-size\": 1,\n \"model-parallel-size\": 1,\n\n # model settings\n \"num-layers\": 12,\n \"hidden-size\": 768,\n \"num-attention-heads\": 12,\n \"seq-length\": 2048,\n \"max-position-embeddings\": 2048,\n \"pos-emb\": \"rotary\",\n \"rotary-pct\": 0.25,\n \"no-weight-tying\": true,\n \"gpt-j-residual\": true,\n \"output-layer-parallelism\": \"column\",\n \n \"attention-config\": [[[\"flash\"], 12]],\n\n \"scaled-upper-triang-masked-softmax-fusion\": true,\n \"bias-gelu-fusion\": true,\n\n # init methods\n \"init_method\": \"small_init\",\n \"output_layer_init_method\": \"wang_init\",\n\n \"optimizer\": {\n \"type\": \"Adam\",\n \"params\": {\n \"lr\": 0.0006,\n \"betas\": [0.9, 0.95],\n \"eps\": 1.0e-8\n }\n },\n \"min_lr\": 0.00006,\n\n \"zero_optimization\": {\n \"stage\": 1,\n \"allgather_partitions\": true,\n \"allgather_bucket_size\": 500000000,\n \"overlap_comm\": true,\n \"reduce_scatter\": true,\n \"reduce_bucket_size\": 500000000,\n \"contiguous_gradients\": true,\n \"cpu_offload\": false\n },\n\n # batch size (trained on 32 gpus)\n \"train_micro_batch_size_per_gpu\": 32,\n \"gradient_accumulation_steps\": 32,\n \"gas\": 1,\n \"data-impl\": \"mmap\",\n \"num_workers\": 1,\n\n # activation checkpointing\n \"checkpoint-activations\": true,\n \"checkpoint-num-layers\": 1,\n \"partition-activations\": true,\n \"synchronize-each-layer\": true,\n\n # regularization\n \"gradient_clipping\": 1.0,\n \"weight-decay\": 0.1,\n \"hidden-dropout\": 0,\n \"attention-dropout\": 0,\n\n # precision settings\n \"fp16\": {\n \"fp16\": true,\n \"enabled\": true,\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"initial_scale_power\": 12,\n \"hysteresis\": 2,\n \"min_loss_scale\": 1\n },\n\n \"train-iters\": 143000,\n \"lr-decay-iters\": 143000,\n \"distributed-backend\": \"nccl\",\n \"lr-decay-style\": \"cosine\",\n \"warmup\": 0.01,\n \"checkpoint-factor\": 1000,\n \"extra-save-iters\": [0,1,2,4,8,16,32,64,128,256,512],\n \"eval-interval\": 40000,\n \"eval-iters\": 10,\n\n \"log-interval\": 10,\n \"steps_per_print\": 10,\n \"wall_clock_breakdown\": true,\n\n \"train-data-paths\": [\"../input/pythia_mydata_idxmaps/mydata_left_text_document\"],\n \"valid-data-paths\": [\"../input/pythia_mydata_idxmaps/mydata_left_text_document\"],\n \"test-data-paths\": [\"../input/pythia_mydata_idxmaps/mydata_left_text_document\"],\n\n \"tokenizer-type\": \"HFTokenizer\",\n \"vocab-file\": \"../input/20B_tokenizer.json\",\n\n \"launcher\": \"slurm\",\n \"deepspeed_slurm\": false,\n\n \"save\": \"../checkpoints/ft-left-pythia160m\",\n \"load\": \"EleutherAI/pythia-160m\",\n \"checkpoint_validation_with_forward_pass\": False,\n\n}\n"}, "load": "EleutherAI/pythia-160m", "checkpoint_factor": 1000, "extra_save_iters": [0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512], "batch_size": 32, "train_iters": 143000, "eval_iters": 10, "eval_interval": 40000, "vocab_file": "../input/20B_tokenizer.json", "num_workers": 1, "attention_dropout": 0, "hidden_dropout": 0, "weight_decay": 0.1, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 32, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "wandb_group": "H3hTSTVUCE7vUsqcaWJsVe_el97yuzz", "log_interval": 10, "text_gen_type": "unconditional", "user_script": "train.py", "save_iters": [0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000, 18000, 19000, 20000, 21000, 22000, 23000, 24000, 25000, 26000, 27000, 28000, 29000, 30000, 31000, 32000, 33000, 34000, 35000, 36000, 37000, 38000, 39000, 40000, 41000, 42000, 43000, 44000, 45000, 46000, 47000, 48000, 49000, 50000, 51000, 52000, 53000, 54000, 55000, 56000, 57000, 58000, 59000, 60000, 61000, 62000, 63000, 64000, 65000, 66000, 67000, 68000, 69000, 70000, 71000, 72000, 73000, 74000, 75000, 76000, 77000, 78000, 79000, 80000, 81000, 82000, 83000, 84000, 85000, 86000, 87000, 88000, 89000, 90000, 91000, 92000, 93000, 94000, 95000, 96000, 97000, 98000, 99000, 100000, 101000, 102000, 103000, 104000, 105000, 106000, 107000, 108000, 109000, 110000, 111000, 112000, 113000, 114000, 115000, 116000, 117000, 118000, 119000, 120000, 121000, 122000, 123000, 124000, 125000, 126000, 127000, 128000, 129000, 130000, 131000, 132000, 133000, 134000, 135000, 136000, 137000, 138000, 139000, 140000, 141000, 142000], "global_num_gpus": 1} [2025-07-22 13:10:07,266] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0]} [2025-07-22 13:10:07,266] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=1, node_rank=0 [2025-07-22 13:10:07,266] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(, {'localhost': [0]}) [2025-07-22 13:10:07,266] [INFO] [launch.py:104:main] dist_world_size=1 [2025-07-22 13:10:07,266] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0 NeoXArgs.configure_distributed_args() using world size: 1 and model-parallel size: 1 > building HFTokenizer tokenizer ... > padded vocab (size: 50277) with 27 dummy tokens (new size: 50304) > initializing torch distributed ... [2025-07-22 13:10:10,724] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl > initializing model parallel with size 1 MPU DP: [0] MPU PP: [0] MPU MP: [0] > setting random seeds to 1234 ... [2025-07-22 13:10:10,727] [INFO] [checkpointing.py:223:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 make: Entering directory '/data/dusi/pythia-retrain/gpt-neox/megatron/data' make: Nothing to be done for 'default'. make: Leaving directory '/data/dusi/pythia-retrain/gpt-neox/megatron/data' building GPT2 model ... SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0} [2025-07-22 13:10:15,613] [INFO] [module.py:363:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp stage=0 layers=17 0: EmbeddingPipe 1: _pre_transformer_block 2: ParallelTransformerLayerPipe 3: ParallelTransformerLayerPipe 4: ParallelTransformerLayerPipe 5: ParallelTransformerLayerPipe 6: ParallelTransformerLayerPipe 7: ParallelTransformerLayerPipe 8: ParallelTransformerLayerPipe 9: ParallelTransformerLayerPipe 10: ParallelTransformerLayerPipe 11: ParallelTransformerLayerPipe 12: ParallelTransformerLayerPipe 13: ParallelTransformerLayerPipe 14: _post_transformer_block 15: NormPipe 16: ParallelLinearPipe loss: partial Configuring Optimizer type: Adam with params: {'lr': 0.0006, 'betas': [0.9, 0.95], 'eps': 1e-08} > learning rate decay style: cosine DeepSpeed is enabled. [2025-07-22 13:10:15,778] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+eb7f5cf, git-hash=eb7f5cf, git-branch=HEAD [2025-07-22 13:10:15,779] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. [2025-07-22 13:10:15,871] [INFO] [engine.py:654:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer [2025-07-22 13:10:15,871] [INFO] [engine.py:659:_configure_optimizer] Using client Optimizer as basic optimizer [2025-07-22 13:10:15,871] [INFO] [engine.py:668:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam Checking ZeRO support for optimizer=FusedAdam type= [2025-07-22 13:10:15,871] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 1 optimizer Using /root/.cache/torch_extensions as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.505185604095459 seconds [2025-07-22 13:10:16,379] [INFO] [stage1.py:160:__init__] ZeRO Elastic Checkpoint = True [2025-07-22 13:10:16,379] [INFO] [logging.py:60:log_dist] [Rank 0] Using default max_elements_per_comm 500000000 [2025-07-22 13:10:16,379] [INFO] [logging.py:60:log_dist] [Rank 0] Total number of elements in model: 162201600, max elements per com: 500000000 [2025-07-22 13:10:16,379] [INFO] [logging.py:60:log_dist] [Rank 0] sub_partition_count: 1, sub_partition_size: 162201600, padding: 0 [2025-07-22 13:10:16,379] [INFO] [logging.py:60:log_dist] [Rank 0] number of elements with padding: 162201600 + 0 = 162201600 [2025-07-22 13:10:16,382] [INFO] [stage1.py:375:get_data_parallel_sub_partitions] **** partition info: [2025-07-22 13:10:16,382] [INFO] [stage1.py:376:get_data_parallel_sub_partitions] total_num_elements=162201600 [2025-07-22 13:10:16,383] [INFO] [stage1.py:377:get_data_parallel_sub_partitions] world_size=1 [2025-07-22 13:10:16,383] [INFO] [stage1.py:378:get_data_parallel_sub_partitions] max_elements_per_comm=162201600 [2025-07-22 13:10:16,383] [INFO] [stage1.py:379:get_data_parallel_sub_partitions] sub_partition_size=162201600 [2025-07-22 13:10:16,384] [INFO] [stage1.py:380:get_data_parallel_sub_partitions] num_sub_partitions=1 [2025-07-22 13:10:16,384] [INFO] [stage1.py:381:get_data_parallel_sub_partitions] num_comm_intervals=1 [2025-07-22 13:10:16,384] [INFO] [stage1.py:382:get_data_parallel_sub_partitions] **** [2025-07-22 13:10:16,387] [INFO] [logging.py:60:log_dist] [Rank 0] Using default max_elements_per_comm 500000000 [2025-07-22 13:10:16,387] [INFO] [logging.py:60:log_dist] [Rank 0] Total number of elements in model: 121344, max elements per com: 500000000 [2025-07-22 13:10:16,387] [INFO] [logging.py:60:log_dist] [Rank 0] sub_partition_count: 1, sub_partition_size: 121344, padding: 0 [2025-07-22 13:10:16,387] [INFO] [logging.py:60:log_dist] [Rank 0] number of elements with padding: 121344 + 0 = 121344 [2025-07-22 13:10:16,388] [INFO] [stage1.py:375:get_data_parallel_sub_partitions] **** partition info: [2025-07-22 13:10:16,388] [INFO] [stage1.py:376:get_data_parallel_sub_partitions] total_num_elements=121344 [2025-07-22 13:10:16,388] [INFO] [stage1.py:377:get_data_parallel_sub_partitions] world_size=1 [2025-07-22 13:10:16,389] [INFO] [stage1.py:378:get_data_parallel_sub_partitions] max_elements_per_comm=121344 [2025-07-22 13:10:16,389] [INFO] [stage1.py:379:get_data_parallel_sub_partitions] sub_partition_size=121344 [2025-07-22 13:10:16,389] [INFO] [stage1.py:380:get_data_parallel_sub_partitions] num_sub_partitions=1 [2025-07-22 13:10:16,389] [INFO] [stage1.py:381:get_data_parallel_sub_partitions] num_comm_intervals=1 [2025-07-22 13:10:16,389] [INFO] [stage1.py:382:get_data_parallel_sub_partitions] **** [2025-07-22 13:10:16,679] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam [2025-07-22 13:10:16,679] [INFO] [engine.py:498:_configure_lr_scheduler] DeepSpeed using client LR scheduler [2025-07-22 13:10:16,679] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2025-07-22 13:10:16,679] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[[0.9, 0.95], [0.9, 0.95]] [2025-07-22 13:10:16,679] [INFO] [config.py:759:print] DeepSpeedEngine configuration: [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] allreduce_always_fp32 ........ False [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] amp_enabled .................. False [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] amp_params ................... False [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] checkpoint_tag_validation_enabled True [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] checkpoint_tag_validation_fail False [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] disable_allgather ............ False [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] dump_state ................... False [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] dynamic_loss_scale_args ...... {'init_scale': 4096, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1} [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] elasticity_enabled ........... False [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 3, "detailed": true } [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] fp16_enabled ................. True [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] fp16_type .................... fp16 [2025-07-22 13:10:16,680] [INFO] [config.py:763:print] global_rank .................. 0 [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] gradient_accumulation_steps .. 32 [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] gradient_clipping ............ 1.0 [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] gradient_predivide_factor .... 1.0 [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] initial_dynamic_scale ........ 4096 [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] loss_scale ................... 0 [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] memory_breakdown ............. False [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] optimizer_legacy_fusion ...... False [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] optimizer_name ............... adam [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] optimizer_params ............. {'lr': 0.0006, 'betas': [0.9, 0.95], 'eps': 1e-08} [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] pld_enabled .................. False [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] pld_params ................... False [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] precision .................... torch.float16 [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] prescale_gradients ........... False [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] scheduler_name ............... None [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] scheduler_params ............. None [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] sparse_attention ............. None [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] sparse_gradients_enabled ..... False [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] steps_per_print .............. 10 [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] tensorboard_enabled .......... False [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] tensorboard_job_name ......... DeepSpeedJobName [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] tensorboard_output_path ...... [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] train_batch_size ............. 1024 [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] train_micro_batch_size_per_gpu 32 [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] wall_clock_breakdown ......... True [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] world_size ................... 1 [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] zero_allow_untested_optimizer False [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] zero_config .................. { "stage": 1, "contiguous_gradients": true, "reduce_scatter": true, "reduce_bucket_size": 5.000000e+08, "allgather_partitions": true, "allgather_bucket_size": 5.000000e+08, "overlap_comm": true, "load_from_fp32_weights": true, "elastic_checkpoint": true, "offload_param": null, "offload_optimizer": null, "sub_group_size": 1.000000e+12, "prefetch_bucket_size": 5.000000e+07, "param_persistence_threshold": 1.000000e+05, "max_live_parameters": 1.000000e+09, "max_reuse_distance": 1.000000e+09, "gather_fp16_weights_on_model_save": false } [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] zero_enabled ................. True [2025-07-22 13:10:16,681] [INFO] [config.py:763:print] zero_optimization_stage ...... 1 [2025-07-22 13:10:16,682] [INFO] [config.py:765:print] json = { "train_batch_size": 1.024000e+03, "train_micro_batch_size_per_gpu": 32, "gradient_accumulation_steps": 32, "optimizer": { "type": "Adam", "params": { "lr": 0.0006, "betas": [0.9, 0.95], "eps": 1e-08 } }, "fp16": { "fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1 }, "gradient_clipping": 1.0, "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 5.000000e+08, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 5.000000e+08, "contiguous_gradients": true, "cpu_offload": false }, "wall_clock_breakdown": true } Using /root/.cache/torch_extensions as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0006787776947021484 seconds [2025-07-22 13:10:16,683] [INFO] [engine.py:84:__init__] CONFIG: micro_batches=32 micro_batch_size=32 [2025-07-22 13:10:16,722] [INFO] [engine.py:141:__init__] RANK=0 STAGE=0 LAYERS=17 [0, 17) STAGE_PARAMS=162322944 (162.323M) TOTAL_PARAMS=162322944 (162.323M) UNIQUE_PARAMS=162322944 (162.323M) > number of parameters on model parallel rank 0: 162322944 > total params: 162,322,944 [2025-07-22 13:10:16,761] [WARNING] [engine.py:1519:load_checkpoint] Unable to find latest file at EleutherAI/pythia-160m/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. Unable to load checkpoint. Loading checkpoint and starting from iteration 0 > building train, validation, and test datasets ... reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_0: no. of documents:281298 > loading doc-idx mapping from ../input/pythia_mydata_idxmaps/mydata_left_text_document_train_0_indexmap_147164160ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from ../input/pythia_mydata_idxmaps/mydata_left_text_document_train_0_indexmap_147164160ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from ../input/pythia_mydata_idxmaps/mydata_left_text_document_train_0_indexmap_147164160ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.177 seconds total number of samples: 147236903 total number of epochs: 1203 WARNING: shuffle index length (147236901) is not equal to sample index length (147236902) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_0: no. of documents:281298 > loading doc-idx mapping from ../input/pythia_mydata_idxmaps/mydata_left_text_document_valid_0_indexmap_41165ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from ../input/pythia_mydata_idxmaps/mydata_left_text_document_valid_0_indexmap_41165ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from ../input/pythia_mydata_idxmaps/mydata_left_text_document_valid_0_indexmap_41165ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.110 seconds total number of samples: 122392 total number of epochs: 1 WARNING: shuffle index length (122390) is not equal to sample index length (122391) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_0: no. of documents:281298 > loading doc-idx mapping from ../input/pythia_mydata_idxmaps/mydata_left_text_document_test_0_indexmap_10292ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from ../input/pythia_mydata_idxmaps/mydata_left_text_document_test_0_indexmap_10292ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from ../input/pythia_mydata_idxmaps/mydata_left_text_document_test_0_indexmap_10292ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.115 seconds total number of samples: 122392 total number of epochs: 1 WARNING: shuffle index length (122390) is not equal to sample index length (122391) > building indices for blendable datasets ... > sample ratios: dataset 0, input: 1, achieved: 1 > RANK 0 elapsed time for building blendable dataset indices: 0.73 (sec) > building indices for blendable datasets ... > sample ratios: dataset 0, input: 1, achieved: 1 > RANK 0 elapsed time for building blendable dataset indices: 0.00 (sec) > building indices for blendable datasets ... > sample ratios: dataset 0, input: 1, achieved: 1 > RANK 0 elapsed time for building blendable dataset indices: 0.00 (sec) setting training data start iteration to 0 setting validation data start iteration to 0 done with setups ... time (ms) | model and optimizer: 1155.94 | train/valid/test data iterators: 1937.87 training ... [2025-07-22 13:10:23,008] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../checkpoints/ft-left-pythia160m/global_step0/mp_rank_00_model_states.pt [2025-07-22 13:10:34,061] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../checkpoints/ft-left-pythia160m/zero_to_fp32.py [2025-07-22 13:10:34,065] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../checkpoints/ft-left-pythia160m/global_step0/zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-07-22 13:10:34,147] [INFO] [checkpointing.py:405:forward] Activation Checkpointing Information [2025-07-22 13:10:34,147] [INFO] [checkpointing.py:406:forward] ----Partition Activations True, CPU CHECKPOINTING False [2025-07-22 13:10:34,147] [INFO] [checkpointing.py:409:forward] ----contiguous Memory Checkpointing False with 12 total layers [2025-07-22 13:10:34,148] [INFO] [checkpointing.py:412:forward] ----Synchronization True [2025-07-22 13:10:34,148] [INFO] [checkpointing.py:413:forward] ----Profiling False [2025-07-22 13:11:21,712] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../checkpoints/ft-left-pythia160m/global_step1/mp_rank_00_model_states.pt [2025-07-22 13:11:37,921] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../checkpoints/ft-left-pythia160m/zero_to_fp32.py [2025-07-22 13:11:37,935] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../checkpoints/ft-left-pythia160m/global_step1/zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-07-22 13:12:12,028] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../checkpoints/ft-left-pythia160m/global_step2/mp_rank_00_model_states.pt [2025-07-22 13:12:26,563] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../checkpoints/ft-left-pythia160m/zero_to_fp32.py [2025-07-22 13:12:26,567] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../checkpoints/ft-left-pythia160m/global_step2/zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-07-22 13:13:33,027] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../checkpoints/ft-left-pythia160m/global_step4/mp_rank_00_model_states.pt [2025-07-22 13:13:47,411] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../checkpoints/ft-left-pythia160m/zero_to_fp32.py [2025-07-22 13:13:47,415] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../checkpoints/ft-left-pythia160m/global_step4/zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-07-22 13:16:02,980] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../checkpoints/ft-left-pythia160m/global_step8/mp_rank_00_model_states.pt [2025-07-22 13:16:17,519] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../checkpoints/ft-left-pythia160m/zero_to_fp32.py [2025-07-22 13:16:17,528] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../checkpoints/ft-left-pythia160m/global_step8/zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-07-22 13:17:20,377] [INFO] [logging.py:60:log_dist] [Rank 0] step=10, skipped=0, lr=[4.195804195804195e-06, 4.195804195804195e-06], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 10 loss: 10.5169 iter time (s): 33.021 samples/sec: 31.010 %comms: 0.0029202485893560707 %optimizer_step 0.08740666622226845 %forward: 19.407368884071147 %backward: 60.109740077045515 [2025-07-22 13:17:20,377] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 39406.22 | forward: 64085.90 | backward_microstep: 198502.35 | backward: 198490.95 | backward_inner_microstep: 198468.66 | backward_inner: 198460.84 | backward_allreduce_microstep: 11.27 | backward_allreduce: 4.00 | reduce_tied_grads: 0.42 | comms: 9.64 | reduce_grads: 0.29 | step: 288.63 | _step_clipping: 0.14 | _step_step: 286.47 | _step_zero_grad: 0.63 | _step_check_overflow: 0.63 samples/sec: 25.203 | iteration 10/ 143000 | elapsed time per iteration (ms): 40630.4 | learning rate: 4.196E-06 | approx flops per GPU: 108.7TFLOPS | lm_loss: 1.089116E+01 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | after 10 iterations memory (MB) | allocated: 14757.37890625 | max allocated: 35286.4248046875 | reserved: 37280.0 | max reserved: 37280.0 time (ms) [2025-07-22 13:20:28,551] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../checkpoints/ft-left-pythia160m/global_step16/mp_rank_00_model_states.pt [2025-07-22 13:20:43,071] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../checkpoints/ft-left-pythia160m/zero_to_fp32.py [2025-07-22 13:20:43,078] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../checkpoints/ft-left-pythia160m/global_step16/zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-07-22 13:22:47,039] [INFO] [logging.py:60:log_dist] [Rank 0] step=20, skipped=0, lr=[8.39160839160839e-06, 8.39160839160839e-06], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 20 loss: 9.7564 iter time (s): 30.885 samples/sec: 33.155 %comms: 0.003041775796182248 %optimizer_step 0.09716823248374855 %forward: 20.261740623314466 %backward: 63.97170341054094 [2025-07-22 13:22:47,039] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 20514.58 | forward: 62579.19 | backward_microstep: 197588.43 | backward: 197579.15 | backward_inner_microstep: 197557.73 | backward_inner: 197550.11 | backward_allreduce_microstep: 10.82 | backward_allreduce: 3.84 | reduce_tied_grads: 0.48 | comms: 9.39 | reduce_grads: 0.34 | step: 300.11 | _step_clipping: 0.16 | _step_step: 298.09 | _step_zero_grad: 0.52 | _step_check_overflow: 0.54 samples/sec: 31.347 | iteration 20/ 143000 | elapsed time per iteration (ms): 32666.2 | learning rate: 8.392E-06 | approx flops per GPU: 135.2TFLOPS | lm_loss: 1.004962E+01 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 13:28:03,221] [INFO] [logging.py:60:log_dist] [Rank 0] step=30, skipped=0, lr=[1.2587412587412587e-05, 1.2587412587412587e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 30 loss: 9.4209 iter time (s): 31.618 samples/sec: 32.387 %comms: 0.005141703418819091 %optimizer_step 0.1011653109091823 %forward: 21.04496404181317 %backward: 62.5296576207062 [2025-07-22 13:28:03,222] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 23696.62 | forward: 66539.03 | backward_microstep: 197713.69 | backward: 197703.49 | backward_inner_microstep: 197683.18 | backward_inner: 197675.80 | backward_allreduce_microstep: 10.13 | backward_allreduce: 3.64 | reduce_tied_grads: 0.49 | comms: 16.26 | reduce_grads: 0.26 | step: 319.86 | _step_clipping: 0.12 | _step_step: 317.66 | _step_zero_grad: 0.65 | _step_check_overflow: 0.74 samples/sec: 32.386 | iteration 30/ 143000 | elapsed time per iteration (ms): 31618.3 | learning rate: 1.259E-05 | approx flops per GPU: 139.7TFLOPS | lm_loss: 9.543085E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 13:29:08,968] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../checkpoints/ft-left-pythia160m/global_step32/mp_rank_00_model_states.pt [2025-07-22 13:29:24,282] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../checkpoints/ft-left-pythia160m/zero_to_fp32.py [2025-07-22 13:29:24,292] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../checkpoints/ft-left-pythia160m/global_step32/zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-07-22 13:33:34,590] [INFO] [logging.py:60:log_dist] [Rank 0] step=40, skipped=0, lr=[1.678321678321678e-05, 1.678321678321678e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 40 loss: 9.1687 iter time (s): 31.260 samples/sec: 32.757 %comms: 0.003061277997044414 %optimizer_step 0.09679032201786736 %forward: 20.146086309342977 %backward: 63.47026190024935 [2025-07-22 13:33:34,591] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 22972.91 | forward: 62977.26 | backward_microstep: 198425.00 | backward: 198409.90 | backward_inner_microstep: 198387.65 | backward_inner: 198378.64 | backward_allreduce_microstep: 10.48 | backward_allreduce: 3.77 | reduce_tied_grads: 0.57 | comms: 9.57 | reduce_grads: 0.25 | step: 302.57 | _step_clipping: 0.14 | _step_step: 300.63 | _step_zero_grad: 0.54 | _step_check_overflow: 0.59 samples/sec: 30.902 | iteration 40/ 143000 | elapsed time per iteration (ms): 33136.9 | learning rate: 1.678E-05 | approx flops per GPU: 133.3TFLOPS | lm_loss: 9.285201E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 13:38:51,868] [INFO] [logging.py:60:log_dist] [Rank 0] step=50, skipped=0, lr=[2.0979020979020977e-05, 2.0979020979020977e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 50 loss: 8.8793 iter time (s): 31.727 samples/sec: 32.275 %comms: 0.0029475459423756348 %optimizer_step 0.08865843063385459 %forward: 19.867652226415608 %backward: 62.337850903629224 [2025-07-22 13:38:51,869] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 28244.20 | forward: 63034.45 | backward_microstep: 197792.29 | backward: 197780.39 | backward_inner_microstep: 197756.62 | backward_inner: 197748.61 | backward_allreduce_microstep: 11.83 | backward_allreduce: 4.35 | reduce_tied_grads: 0.40 | comms: 9.35 | reduce_grads: 0.22 | step: 281.29 | _step_clipping: 0.11 | _step_step: 279.40 | _step_zero_grad: 0.54 | _step_check_overflow: 0.60 samples/sec: 32.275 | iteration 50/ 143000 | elapsed time per iteration (ms): 31727.8 | learning rate: 2.098E-05 | approx flops per GPU: 139.2TFLOPS | lm_loss: 9.011149E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 13:44:06,660] [INFO] [logging.py:60:log_dist] [Rank 0] step=60, skipped=0, lr=[2.5174825174825174e-05, 2.5174825174825174e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 60 loss: 8.5616 iter time (s): 31.479 samples/sec: 32.530 %comms: 0.002975815196947126 %optimizer_step 0.08685585813075757 %forward: 19.7865227432266 %backward: 62.82009812687369 [2025-07-22 13:44:06,661] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 26570.13 | forward: 62285.31 | backward_microstep: 197763.54 | backward: 197749.22 | backward_inner_microstep: 197728.71 | backward_inner: 197721.09 | backward_allreduce_microstep: 10.23 | backward_allreduce: 3.64 | reduce_tied_grads: 0.47 | comms: 9.37 | reduce_grads: 0.26 | step: 273.41 | _step_clipping: 0.18 | _step_step: 271.37 | _step_zero_grad: 0.51 | _step_check_overflow: 0.65 samples/sec: 32.529 | iteration 60/ 143000 | elapsed time per iteration (ms): 31479.3 | learning rate: 2.517E-05 | approx flops per GPU: 140.3TFLOPS | lm_loss: 8.707314E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 13:46:13,673] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../checkpoints/ft-left-pythia160m/global_step64/mp_rank_00_model_states.pt [2025-07-22 13:46:27,738] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../checkpoints/ft-left-pythia160m/zero_to_fp32.py [2025-07-22 13:46:27,744] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../checkpoints/ft-left-pythia160m/global_step64/zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-07-22 13:49:44,577] [INFO] [logging.py:60:log_dist] [Rank 0] step=70, skipped=0, lr=[2.937062937062937e-05, 2.937062937062937e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 70 loss: 8.2687 iter time (s): 32.059 samples/sec: 31.941 %comms: 0.0029918048204810943 %optimizer_step 0.09108438928429649 %forward: 19.59232384942212 %backward: 61.604361563416546 [2025-07-22 13:49:44,578] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 32122.27 | forward: 62810.47 | backward_microstep: 197505.99 | backward: 197495.65 | backward_inner_microstep: 197474.68 | backward_inner: 197466.99 | backward_allreduce_microstep: 10.60 | backward_allreduce: 3.78 | reduce_tied_grads: 0.49 | comms: 9.59 | reduce_grads: 0.29 | step: 292.00 | _step_clipping: 0.15 | _step_step: 289.75 | _step_zero_grad: 0.68 | _step_check_overflow: 0.65 samples/sec: 30.303 | iteration 70/ 143000 | elapsed time per iteration (ms): 33791.7 | learning rate: 2.937E-05 | approx flops per GPU: 130.7TFLOPS | lm_loss: 8.397779E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 13:55:06,344] [INFO] [logging.py:60:log_dist] [Rank 0] step=80, skipped=0, lr=[3.356643356643356e-05, 3.356643356643356e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 80 loss: 7.9830 iter time (s): 32.176 samples/sec: 31.825 %comms: 0.0029565817913203727 %optimizer_step 0.08664005041594007 %forward: 19.60686961530985 %backward: 61.50932281535838 [2025-07-22 13:55:06,346] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 32556.53 | forward: 63087.34 | backward_microstep: 197928.38 | backward: 197913.27 | backward_inner_microstep: 197891.41 | backward_inner: 197883.38 | backward_allreduce_microstep: 10.69 | backward_allreduce: 3.85 | reduce_tied_grads: 0.44 | comms: 9.51 | reduce_grads: 0.24 | step: 278.77 | _step_clipping: 0.12 | _step_step: 276.50 | _step_zero_grad: 0.56 | _step_check_overflow: 0.86 samples/sec: 31.824 | iteration 80/ 143000 | elapsed time per iteration (ms): 32176.8 | learning rate: 3.357E-05 | approx flops per GPU: 137.3TFLOPS | lm_loss: 8.102886E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 14:00:19,626] [INFO] [logging.py:60:log_dist] [Rank 0] step=90, skipped=0, lr=[3.7762237762237757e-05, 3.7762237762237757e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 90 loss: 7.7203 iter time (s): 31.327 samples/sec: 32.687 %comms: 0.0032620422704162156 %optimizer_step 0.093225213636618 %forward: 20.110852327481123 %backward: 63.30882982900465 [2025-07-22 14:00:19,628] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 23677.98 | forward: 63001.87 | backward_microstep: 198346.84 | backward: 198329.46 | backward_inner_microstep: 198306.53 | backward_inner: 198297.45 | backward_allreduce_microstep: 10.95 | backward_allreduce: 3.94 | reduce_tied_grads: 0.54 | comms: 10.22 | reduce_grads: 0.28 | step: 292.05 | _step_clipping: 0.15 | _step_step: 288.85 | _step_zero_grad: 0.72 | _step_check_overflow: 1.47 samples/sec: 32.686 | iteration 90/ 143000 | elapsed time per iteration (ms): 31328.2 | learning rate: 3.776E-05 | approx flops per GPU: 141.0TFLOPS | lm_loss: 7.833335E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 14:05:33,398] [INFO] [logging.py:60:log_dist] [Rank 0] step=100, skipped=0, lr=[4.1958041958041954e-05, 4.1958041958041954e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 100 loss: 7.4749 iter time (s): 31.376 samples/sec: 32.636 %comms: 0.0030718526758605398 %optimizer_step 0.0902252051366097 %forward: 19.919019866263042 %backward: 63.10597794544276 [2025-07-22 14:05:33,399] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 25031.85 | forward: 62498.34 | backward_microstep: 198017.65 | backward: 198002.64 | backward_inner_microstep: 197980.13 | backward_inner: 197971.77 | backward_allreduce_microstep: 11.18 | backward_allreduce: 3.95 | reduce_tied_grads: 0.50 | comms: 9.64 | reduce_grads: 0.26 | step: 283.09 | _step_clipping: 0.15 | _step_step: 280.92 | _step_zero_grad: 0.60 | _step_check_overflow: 0.63 samples/sec: 32.635 | iteration 100/ 143000 | elapsed time per iteration (ms): 31377.1 | learning rate: 4.196E-05 | approx flops per GPU: 140.8TFLOPS | lm_loss: 7.582018E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 14:10:49,596] [INFO] [logging.py:60:log_dist] [Rank 0] step=110, skipped=0, lr=[4.6153846153846145e-05, 4.6153846153846145e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 110 loss: 7.2539 iter time (s): 31.619 samples/sec: 32.385 %comms: 0.002959952938384529 %optimizer_step 0.08988906934870611 %forward: 19.865516975653204 %backward: 62.4885572049885 [2025-07-22 14:10:49,597] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 27623.41 | forward: 62813.09 | backward_microstep: 197596.50 | backward: 197583.54 | backward_inner_microstep: 197562.23 | backward_inner: 197554.49 | backward_allreduce_microstep: 10.49 | backward_allreduce: 3.79 | reduce_tied_grads: 0.39 | comms: 9.36 | reduce_grads: 0.23 | step: 284.22 | _step_clipping: 0.12 | _step_step: 281.91 | _step_zero_grad: 0.61 | _step_check_overflow: 0.89 samples/sec: 32.385 | iteration 110/ 143000 | elapsed time per iteration (ms): 31619.9 | learning rate: 4.615E-05 | approx flops per GPU: 139.7TFLOPS | lm_loss: 7.349136E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 14:16:04,937] [INFO] [logging.py:60:log_dist] [Rank 0] step=120, skipped=0, lr=[5.034965034965035e-05, 5.034965034965035e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 120 loss: 7.0663 iter time (s): 31.533 samples/sec: 32.474 %comms: 0.0030185219236215856 %optimizer_step 0.10792091046430072 %forward: 19.932997447797106 %backward: 62.68394419933894 [2025-07-22 14:16:04,939] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 26556.17 | forward: 62855.25 | backward_microstep: 197674.73 | backward: 197662.94 | backward_inner_microstep: 197642.02 | backward_inner: 197634.13 | backward_allreduce_microstep: 10.24 | backward_allreduce: 3.67 | reduce_tied_grads: 0.42 | comms: 9.52 | reduce_grads: 0.25 | step: 340.31 | _step_clipping: 0.13 | _step_step: 337.82 | _step_zero_grad: 0.60 | _step_check_overflow: 1.05 samples/sec: 32.473 | iteration 120/ 143000 | elapsed time per iteration (ms): 31534.1 | learning rate: 5.035E-05 | approx flops per GPU: 140.1TFLOPS | lm_loss: 7.138102E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 14:20:26,392] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../checkpoints/ft-left-pythia160m/global_step128/mp_rank_00_model_states.pt [2025-07-22 14:20:39,292] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../checkpoints/ft-left-pythia160m/zero_to_fp32.py [2025-07-22 14:20:39,298] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../checkpoints/ft-left-pythia160m/global_step128/zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-07-22 14:21:42,818] [INFO] [logging.py:60:log_dist] [Rank 0] step=130, skipped=0, lr=[5.4545454545454546e-05, 5.4545454545454546e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 130 loss: 6.8695 iter time (s): 32.226 samples/sec: 31.776 %comms: 0.0029527807011189592 %optimizer_step 0.08389586040010442 %forward: 19.49197526041939 %backward: 61.32795156183104 [2025-07-22 14:21:42,819] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 33620.33 | forward: 62814.13 | backward_microstep: 197647.44 | backward: 197633.23 | backward_inner_microstep: 197610.95 | backward_inner: 197603.07 | backward_allreduce_microstep: 11.07 | backward_allreduce: 3.99 | reduce_tied_grads: 0.45 | comms: 9.52 | reduce_grads: 0.28 | step: 270.36 | _step_clipping: 0.13 | _step_step: 268.33 | _step_zero_grad: 0.54 | _step_check_overflow: 0.67 samples/sec: 30.307 | iteration 130/ 143000 | elapsed time per iteration (ms): 33788.0 | learning rate: 5.455E-05 | approx flops per GPU: 130.7TFLOPS | lm_loss: 6.949750E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 14:27:00,823] [INFO] [logging.py:60:log_dist] [Rank 0] step=140, skipped=0, lr=[5.874125874125874e-05, 5.874125874125874e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 140 loss: 6.7054 iter time (s): 31.800 samples/sec: 32.202 %comms: 0.0029615164700150004 %optimizer_step 0.09375193965270955 %forward: 19.68324297611628 %backward: 62.179419347206824 [2025-07-22 14:27:00,824] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 29448.48 | forward: 62592.12 | backward_microstep: 197741.85 | backward: 197728.70 | backward_inner_microstep: 197706.36 | backward_inner: 197698.36 | backward_allreduce_microstep: 11.19 | backward_allreduce: 3.97 | reduce_tied_grads: 0.45 | comms: 9.42 | reduce_grads: 0.30 | step: 298.13 | _step_clipping: 0.15 | _step_step: 295.74 | _step_zero_grad: 0.70 | _step_check_overflow: 0.72 samples/sec: 32.201 | iteration 140/ 143000 | elapsed time per iteration (ms): 31800.4 | learning rate: 5.874E-05 | approx flops per GPU: 138.9TFLOPS | lm_loss: 6.773657E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 14:32:15,534] [INFO] [logging.py:60:log_dist] [Rank 0] step=150, skipped=0, lr=[6.293706293706293e-05, 6.293706293706293e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 150 loss: 6.5431 iter time (s): 31.470 samples/sec: 32.539 %comms: 0.003080310973177262 %optimizer_step 0.09104126171816311 %forward: 20.05337664735878 %backward: 63.137387660470154 [2025-07-22 14:32:15,535] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 24580.14 | forward: 63108.77 | backward_microstep: 198716.11 | backward: 198695.87 | backward_inner_microstep: 198670.25 | backward_inner: 198661.09 | backward_allreduce_microstep: 11.23 | backward_allreduce: 3.98 | reduce_tied_grads: 0.48 | comms: 9.69 | reduce_grads: 0.29 | step: 286.51 | _step_clipping: 0.16 | _step_step: 283.96 | _step_zero_grad: 0.63 | _step_check_overflow: 0.94 samples/sec: 32.538 | iteration 150/ 143000 | elapsed time per iteration (ms): 31471.1 | learning rate: 6.294E-05 | approx flops per GPU: 140.4TFLOPS | lm_loss: 6.615549E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 14:37:29,441] [INFO] [logging.py:60:log_dist] [Rank 0] step=160, skipped=0, lr=[6.713286713286712e-05, 6.713286713286712e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 160 loss: 6.4128 iter time (s): 31.390 samples/sec: 32.622 %comms: 0.002949356010268444 %optimizer_step 0.08301680290827784 %forward: 19.803081986114215 %backward: 62.88335286953507 [2025-07-22 14:37:29,442] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 26224.10 | forward: 62161.90 | backward_microstep: 197401.73 | backward: 197390.92 | backward_inner_microstep: 197368.88 | backward_inner: 197361.18 | backward_allreduce_microstep: 11.21 | backward_allreduce: 3.96 | reduce_tied_grads: 0.41 | comms: 9.26 | reduce_grads: 0.24 | step: 260.59 | _step_clipping: 0.12 | _step_step: 258.62 | _step_zero_grad: 0.53 | _step_check_overflow: 0.67 samples/sec: 32.621 | iteration 160/ 143000 | elapsed time per iteration (ms): 31390.7 | learning rate: 6.713E-05 | approx flops per GPU: 140.7TFLOPS | lm_loss: 6.472316E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 14:42:47,996] [INFO] [logging.py:60:log_dist] [Rank 0] step=170, skipped=0, lr=[7.132867132867133e-05, 7.132867132867133e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 170 loss: 6.2937 iter time (s): 31.855 samples/sec: 32.146 %comms: 0.002945319324194505 %optimizer_step 0.09419281189115704 %forward: 19.763638588275366 %backward: 62.01306181010894 [2025-07-22 14:42:47,997] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 29884.72 | forward: 62956.62 | backward_microstep: 197553.06 | backward: 197541.20 | backward_inner_microstep: 197519.10 | backward_inner: 197511.16 | backward_allreduce_microstep: 11.17 | backward_allreduce: 3.95 | reduce_tied_grads: 0.40 | comms: 9.38 | reduce_grads: 0.25 | step: 300.05 | _step_clipping: 0.13 | _step_step: 298.05 | _step_zero_grad: 0.59 | _step_check_overflow: 0.57 samples/sec: 32.145 | iteration 170/ 143000 | elapsed time per iteration (ms): 31855.5 | learning rate: 7.133E-05 | approx flops per GPU: 138.7TFLOPS | lm_loss: 6.339196E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 14:48:02,644] [INFO] [logging.py:60:log_dist] [Rank 0] step=180, skipped=0, lr=[7.552447552447551e-05, 7.552447552447551e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 180 loss: 6.1772 iter time (s): 31.464 samples/sec: 32.545 %comms: 0.0029692279040413982 %optimizer_step 0.08902728052945316 %forward: 19.788613690458888 %backward: 62.73919161526472 [2025-07-22 14:48:02,645] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 26794.51 | forward: 62263.25 | backward_microstep: 197414.64 | backward: 197403.72 | backward_inner_microstep: 197382.18 | backward_inner: 197374.45 | backward_allreduce_microstep: 10.90 | backward_allreduce: 3.85 | reduce_tied_grads: 0.43 | comms: 9.34 | reduce_grads: 0.26 | step: 280.12 | _step_clipping: 0.12 | _step_step: 278.17 | _step_zero_grad: 0.53 | _step_check_overflow: 0.63 samples/sec: 32.544 | iteration 180/ 143000 | elapsed time per iteration (ms): 31464.8 | learning rate: 7.552E-05 | approx flops per GPU: 140.4TFLOPS | lm_loss: 6.226228E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 14:53:20,296] [INFO] [logging.py:60:log_dist] [Rank 0] step=190, skipped=0, lr=[7.972027972027971e-05, 7.972027972027971e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 190 loss: 6.0809 iter time (s): 31.764 samples/sec: 32.237 %comms: 0.002966380726209152 %optimizer_step 0.08907308526215107 %forward: 19.697569194769123 %backward: 62.227146489786456 [2025-07-22 14:53:20,297] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 29184.54 | forward: 62568.21 | backward_microstep: 197674.65 | backward: 197661.01 | backward_inner_microstep: 197638.34 | backward_inner: 197629.95 | backward_allreduce_microstep: 11.31 | backward_allreduce: 4.05 | reduce_tied_grads: 0.45 | comms: 9.42 | reduce_grads: 0.26 | step: 282.94 | _step_clipping: 0.14 | _step_step: 280.66 | _step_zero_grad: 0.72 | _step_check_overflow: 0.66 samples/sec: 32.237 | iteration 190/ 143000 | elapsed time per iteration (ms): 31765.2 | learning rate: 7.972E-05 | approx flops per GPU: 139.1TFLOPS | lm_loss: 6.123369E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 14:58:42,422] [INFO] [logging.py:60:log_dist] [Rank 0] step=200, skipped=0, lr=[8.391608391608391e-05, 8.391608391608391e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 200 loss: 5.9852 iter time (s): 32.212 samples/sec: 31.790 %comms: 0.0030288758637654033 %optimizer_step 0.09202343942535417 %forward: 19.667126779422844 %backward: 61.46910983719454 [2025-07-22 14:58:42,423] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 32517.52 | forward: 63351.43 | backward_microstep: 198019.08 | backward: 198003.29 | backward_inner_microstep: 197979.88 | backward_inner: 197971.18 | backward_allreduce_microstep: 11.57 | backward_allreduce: 4.13 | reduce_tied_grads: 0.44 | comms: 9.76 | reduce_grads: 0.25 | step: 296.42 | _step_clipping: 0.13 | _step_step: 294.04 | _step_zero_grad: 0.67 | _step_check_overflow: 0.77 samples/sec: 31.789 | iteration 200/ 143000 | elapsed time per iteration (ms): 32212.6 | learning rate: 8.392E-05 | approx flops per GPU: 137.1TFLOPS | lm_loss: 6.031245E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 15:04:00,859] [INFO] [logging.py:60:log_dist] [Rank 0] step=210, skipped=0, lr=[8.811188811188812e-05, 8.811188811188812e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 210 loss: 5.9209 iter time (s): 31.843 samples/sec: 32.158 %comms: 0.0030011359123102094 %optimizer_step 0.10631297291241648 %forward: 19.917893344938175 %backward: 62.0841471639291 [2025-07-22 15:04:00,860] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 29078.88 | forward: 63424.65 | backward_microstep: 197707.24 | backward: 197694.86 | backward_inner_microstep: 197671.75 | backward_inner: 197663.33 | backward_allreduce_microstep: 11.59 | backward_allreduce: 4.15 | reduce_tied_grads: 0.49 | comms: 9.56 | reduce_grads: 0.36 | step: 338.53 | _step_clipping: 0.16 | _step_step: 336.34 | _step_zero_grad: 0.60 | _step_check_overflow: 0.65 samples/sec: 32.157 | iteration 210/ 143000 | elapsed time per iteration (ms): 31843.7 | learning rate: 8.811E-05 | approx flops per GPU: 138.7TFLOPS | lm_loss: 5.949513E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 15:09:29,446] [INFO] [logging.py:60:log_dist] [Rank 0] step=220, skipped=0, lr=[9.230769230769229e-05, 9.230769230769229e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 220 loss: 5.8661 iter time (s): 32.858 samples/sec: 31.164 %comms: 0.002811416580656313 %optimizer_step 0.08703868844080195 %forward: 19.65704974095919 %backward: 60.21434470462042 [2025-07-22 15:09:29,447] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 37966.46 | forward: 64589.23 | backward_microstep: 197865.90 | backward: 197852.59 | backward_inner_microstep: 197829.73 | backward_inner: 197821.49 | backward_allreduce_microstep: 11.23 | backward_allreduce: 4.02 | reduce_tied_grads: 0.38 | comms: 9.24 | reduce_grads: 0.25 | step: 285.99 | _step_clipping: 0.12 | _step_step: 283.85 | _step_zero_grad: 0.61 | _step_check_overflow: 0.72 samples/sec: 31.164 | iteration 220/ 143000 | elapsed time per iteration (ms): 32858.7 | learning rate: 9.231E-05 | approx flops per GPU: 134.4TFLOPS | lm_loss: 5.879067E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 15:14:43,189] [INFO] [logging.py:60:log_dist] [Rank 0] step=230, skipped=0, lr=[9.650349650349649e-05, 9.650349650349649e-05], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 230 loss: 5.7841 iter time (s): 31.374 samples/sec: 32.639 %comms: 0.002968596633937884 %optimizer_step 0.08333508244246683 %forward: 19.90188371682342 %backward: 63.006612742024714 [2025-07-22 15:14:43,190] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 25464.29 | forward: 62439.56 | backward_microstep: 197686.68 | backward: 197675.00 | backward_inner_microstep: 197654.31 | backward_inner: 197646.64 | backward_allreduce_microstep: 10.29 | backward_allreduce: 3.64 | reduce_tied_grads: 0.42 | comms: 9.31 | reduce_grads: 0.25 | step: 261.45 | _step_clipping: 0.12 | _step_step: 259.62 | _step_zero_grad: 0.50 | _step_check_overflow: 0.56 samples/sec: 32.638 | iteration 230/ 143000 | elapsed time per iteration (ms): 31374.3 | learning rate: 9.650E-05 | approx flops per GPU: 140.8TFLOPS | lm_loss: 5.813722E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 15:19:56,462] [INFO] [logging.py:60:log_dist] [Rank 0] step=240, skipped=0, lr=[0.0001006993006993007, 0.0001006993006993007], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 240 loss: 5.7510 iter time (s): 31.327 samples/sec: 32.688 %comms: 0.0029713780219708663 %optimizer_step 0.08363597279950541 %forward: 19.853572162391792 %backward: 62.980888806186385 [2025-07-22 15:19:56,463] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 25626.42 | forward: 62194.63 | backward_microstep: 197308.70 | backward: 197298.16 | backward_inner_microstep: 197277.91 | backward_inner: 197270.52 | backward_allreduce_microstep: 10.19 | backward_allreduce: 3.66 | reduce_tied_grads: 0.41 | comms: 9.31 | reduce_grads: 0.27 | step: 262.00 | _step_clipping: 0.12 | _step_step: 260.13 | _step_zero_grad: 0.55 | _step_check_overflow: 0.54 samples/sec: 32.687 | iteration 240/ 143000 | elapsed time per iteration (ms): 31327.3 | learning rate: 1.007E-04 | approx flops per GPU: 141.0TFLOPS | lm_loss: 5.761162E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 15:25:15,689] [INFO] [logging.py:60:log_dist] [Rank 0] step=250, skipped=0, lr=[0.0001048951048951049, 0.0001048951048951049], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 250 loss: 5.7047 iter time (s): 31.922 samples/sec: 32.078 %comms: 0.0028756195594488994 %optimizer_step 0.08289771416447038 %forward: 19.606282989289152 %backward: 61.72873315284612 [2025-07-22 15:25:15,690] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 31456.32 | forward: 62587.45 | backward_microstep: 197060.06 | backward: 197051.31 | backward_inner_microstep: 197033.00 | backward_inner: 197026.27 | backward_allreduce_microstep: 9.15 | backward_allreduce: 3.29 | reduce_tied_grads: 0.38 | comms: 9.18 | reduce_grads: 0.22 | step: 264.63 | _step_clipping: 0.12 | _step_step: 262.85 | _step_zero_grad: 0.52 | _step_check_overflow: 0.52 samples/sec: 32.077 | iteration 250/ 143000 | elapsed time per iteration (ms): 31922.7 | learning rate: 1.049E-04 | approx flops per GPU: 138.4TFLOPS | lm_loss: 5.711086E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) [2025-07-22 15:28:26,876] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../checkpoints/ft-left-pythia160m/global_step256/mp_rank_00_model_states.pt [2025-07-22 15:28:40,016] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../checkpoints/ft-left-pythia160m/zero_to_fp32.py [2025-07-22 15:28:40,025] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../checkpoints/ft-left-pythia160m/global_step256/zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-07-22 15:30:50,262] [INFO] [logging.py:60:log_dist] [Rank 0] step=260, skipped=0, lr=[0.00010909090909090909, 0.00010909090909090909], mom=[[0.9, 0.95], [0.9, 0.95]] steps: 260 loss: 5.6691 iter time (s): 31.803 samples/sec: 32.198 %comms: 0.004088966348019861 %optimizer_step 0.08394952875727361 %forward: 19.84244477390803 %backward: 62.01140836604374 [2025-07-22 15:30:50,262] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | train_batch: 0.00 | batch_input: 29608.28 | forward: 63105.72 | backward_microstep: 197226.22 | backward: 197217.36 | backward_inner_microstep: 197198.20 | backward_inner: 197191.23 | backward_allreduce_microstep: 9.62 | backward_allreduce: 3.39 | reduce_tied_grads: 0.42 | comms: 13.00 | reduce_grads: 0.23 | step: 266.99 | _step_clipping: 0.13 | _step_step: 265.10 | _step_zero_grad: 0.55 | _step_check_overflow: 0.55 samples/sec: 30.606 | iteration 260/ 143000 | elapsed time per iteration (ms): 33457.3 | learning rate: 1.091E-04 | approx flops per GPU: 132.0TFLOPS | lm_loss: 5.671705E+00 | loss scale: 4096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms)