24222426643
62%

Ran 10 Apr 2026 01:57AM UTC

Jobs 1

Files 40

Run time 1min

Badge

Embed ▾

Committed 10 Apr 2026 01:53AM UTC coverage: 58.057%. Remained the same

Build # 24222426643

Build Type

push

github

Committed by

web-flow

Commit Message

feat: add Megatron-Core GPT Tensor Parallelism example notebook (#3201)

* feat: add Megatron-Core GPT Tensor Parallelism example notebook

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* added megatron notebook to the e2e gpu test

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* feat: Parameterize Megatron core GPT notebook's tensor parallelism and GPU count, and update the e2e test workflow to pass these parameters.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* change the number of gpus to 2

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* docs: update Megatron GPT tensor parallelism example notebook.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* fix: correct minor typos and improve code readability in Megatron-Core GPT notebook

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* fix: update tensor model parallel size retrieval and improve code clarity in Megatron-Core GPT notebook

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* fix: correct wording in training function description for clarity in Megatron-Core GPT notebook

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* feat: add verification step for TrainJob completion in Megatron-Core GPT notebook

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* fix: use multi-node setup for Megatron TP to work with GPU time-slicing

GPU time-slicing (replicas=2) advertises 2 GPUs to Kubernetes but
exposes only 1 CUDA device inside the container. This caused torchrun
to launch 1 worker (WORLD_SIZE=1), failing Megatron's TP requirement
of world_size >= 2.

Switch from 1 node with 2 GPUs to 2 nodes with 1 GPU each. This
creates 2 pods, each getting 1 time-sliced GPU, giving WORLD_SIZE=2
for tensor parallelism across pods.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* fix: use PyTorch devel image for Megatron compile_helpers() support

Megatron-Core's compile_helpers() requires `make` and `gcc` to build
C dataset helpers. The default runtime image d... (continued)

Coverage Stats

2032 of 3500 relevant lines covered (58.06%)

0.67 hits per line

Jobs

ID	Job ID	Ran	Files	Coverage
1	24222426643.1	10 Apr 2026 01:57AM UTC	40	58.06	GitHub Action Run

kubeflow / trainer / 24222426643
62%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Jobs

Source Files on build 24222426643

kubeflow / trainer / 24222426643 62%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Jobs

Source Files on build 24222426643

kubeflow / trainer / 24222426643
62%

README BADGES
x