• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

kubeflow / trainer
58%
master: 58%

Build:
Build:
LAST BUILD BRANCH: megatron
DEFAULT BRANCH: master
Repo Added 20 Mar 2025 01:49PM UTC
Token 3qIdUH6ns6RNy0sBPPQ6ybJp7VqYkScU8 regen
Build 3010 Last
Files 40
Badge
Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

LAST BUILD ON BRANCH megatron
branch: megatron
CHANGE BRANCH
x
Reset
Sync Branches
  • megatron
  • 2836-expose-builruntimeinfo
  • 2871-allow-podspecoverride-dupl-jobs
  • Bug
  • KEP-volcano-scheduler
  • Xgboost-E2E-renable
  • a10-2-gpu
  • add-akshay-reviewer
  • add-audio-examples
  • add-config-api-tests-2885
  • add-core-runtimes-function
  • add-dependabot
  • add-gitattr
  • add-gpu-e2e-timeout
  • add-license-scan-badge
  • add-local-example
  • add-local-trainer-client
  • add-local-trainer-example
  • add-manager-field-podtemplateoverride
  • add-ok-to-test
  • add-overlay-manifest-v2
  • add-patch-updates-k8s
  • add-pod-network-plugin-to-diagram
  • add-qwen3-1.7b
  • add-r-generation
  • add-runtime-labels
  • add-sdk-release
  • add-standalone-manifest
  • agents-md
  • automate-release
  • bo/feat/remove-launcher-chainer-validation
  • bo/test/add-ut-for-torch-runtime-valid
  • bump-go-1.25
  • bump-jobset-v0.9.0
  • bump-master-2.2
  • bump-torch-deepspeed
  • bump-trivy-0.69.2
  • cache-example
  • cache-oss
  • cache_initilizer
  • cache_pipeline
  • changelog-1.9.1
  • changelog-2.0.0
  • changelog-2.0.1
  • changelog-v2.0.0-rc.0
  • changelog-v2.0.0-rc.1
  • changelog-v2.1.0
  • changelog-v2.1.0-rc.0
  • changelog-v2.1.0-rc.1
  • cherry-pick-2666-to-release-2.0
  • cherry-pick-2675-to-release-2.0
  • cherry-pick-2682-to-release-2.0
  • cherry-pick-2683-to-release-2.0
  • cherry-pick-2685-to-release-2.0
  • cherry-pick-2686-to-release-2.0
  • cherry-pick-2691-to-release-2.0
  • cherry-pick-2695-to-release-2.0
  • cherry-pick-2700-to-release-2.0
  • cherry-pick-2703-to-release-2.0
  • cherry-pick-2707-to-release-2.0
  • cherry-pick-2719-to-release-2.0
  • cherry-pick-2726-to-release-2.0
  • cherry-pick-2728-to-release-2.1
  • cherry-pick-2731-to-release-2.0
  • cherry-pick-2734-to-release-2.0
  • cherry-pick-2739-to-release-2.0
  • cherry-pick-2761
  • cherry-pick-2766
  • cherry-pick-2771-to-release-2.0
  • cherry-pick-2774-to-release-2.0
  • cherry-pick-2780
  • cherry-pick-2813
  • cherry-pick-2815
  • cherry-pick-2837-to-release-2.0
  • cherry-pick-2854-to-release-2.0
  • cherry-pick-2877-to-release-2.1
  • cherry-pick-2904-to-release-2.1
  • cherry-pick-2907-to-release-2.1
  • cherry-pick-2908-to-release-2.1
  • cherry-pick-2913-to-release-2.1
  • cherry-pick-2923-to-release-2.1
  • cherry-pick-2926-to-release-2.1
  • cherry-pick-2971-to-release-2.1
  • cherry-pick-3009-to-release-2.1
  • cherry-pick-3010-to-release-2.1
  • cherry-pick-3307-to-release-2.2
  • cherry-pick-3319-to-release-2.2
  • cherry-pick-3322-to-release-2.2
  • cherry-pick-3323-to-release-2.2
  • cherry-pick-3331-to-release-2.2
  • cherry-pick-3333-to-release-2.2
  • cherry-pick-3335-to-release-2.2
  • cherry-pick-3360-to-release-2.2
  • cherry-pick-changelog-1.9
  • chore/KEP-runtime-class
  • chore/gha
  • chore/merge-podspacoverride-test-cases
  • chore/remove-copyright-year
  • chore/rename-certmanagement-config-fields
  • chore/upgrade-torchtune-version
  • ci/include-1.32-k8s
  • claude-symlink
  • code-quality-check
  • code-quality-clean
  • config-api-implementation
  • coscheduling-indexers-ut
  • deepspeed-runtime
  • dependabot/cargo/pkg/data_cache/arrow-56.2.0
  • dependabot/cargo/pkg/data_cache/arrow-57.0.0
  • dependabot/cargo/pkg/data_cache/arrow-57.1.0
  • dependabot/cargo/pkg/data_cache/arrow-57.2.0
  • dependabot/cargo/pkg/data_cache/arrow-58.0.0
  • dependabot/cargo/pkg/data_cache/arrow-58.1.0
  • dependabot/cargo/pkg/data_cache/arrow-flight-56.2.0
  • dependabot/cargo/pkg/data_cache/arrow-flight-57.1.0
  • dependabot/cargo/pkg/data_cache/arrow-flight-57.2.0
  • dependabot/cargo/pkg/data_cache/arrow-flight-58.0.0
  • dependabot/cargo/pkg/data_cache/arrow-flight-58.1.0
  • dependabot/cargo/pkg/data_cache/arrow-schema-56.2.0
  • dependabot/cargo/pkg/data_cache/arrow-schema-57.2.0
  • dependabot/cargo/pkg/data_cache/async-trait-0.1.89
  • dependabot/cargo/pkg/data_cache/axum-0.8.8
  • dependabot/cargo/pkg/data_cache/bincode-2.0.1
  • dependabot/cargo/pkg/data_cache/bincode-3.0.0
  • dependabot/cargo/pkg/data_cache/bytes-1.11.0
  • dependabot/cargo/pkg/data_cache/bytes-1.11.1
  • dependabot/cargo/pkg/data_cache/crossbeam-channel-0.5.15
  • dependabot/cargo/pkg/data_cache/datafusion-51.0.0
  • dependabot/cargo/pkg/data_cache/futures-0.3.32
  • dependabot/cargo/pkg/data_cache/hickory-resolver-0.25.2
  • dependabot/cargo/pkg/data_cache/iceberg-0.6.0
  • dependabot/cargo/pkg/data_cache/iceberg-0.7.0
  • dependabot/cargo/pkg/data_cache/iceberg-0.8.0
  • dependabot/cargo/pkg/data_cache/iceberg-0.9.0
  • dependabot/cargo/pkg/data_cache/iceberg-datafusion-0.6.0
  • dependabot/cargo/pkg/data_cache/iceberg-datafusion-0.7.0
  • dependabot/cargo/pkg/data_cache/iceberg-datafusion-0.8.0
  • dependabot/cargo/pkg/data_cache/lz4_flex-0.11.6
  • dependabot/cargo/pkg/data_cache/quinn-proto-0.11.14
  • dependabot/cargo/pkg/data_cache/ring-0.17.14
  • dependabot/cargo/pkg/data_cache/rustls-webpki-0.103.10
  • dependabot/cargo/pkg/data_cache/serde-1.0.228
  • dependabot/cargo/pkg/data_cache/test/arrow-flight-57.0.0
  • dependabot/cargo/pkg/data_cache/test/arrow-flight-57.1.0
  • dependabot/cargo/pkg/data_cache/test/arrow-flight-57.2.0
  • dependabot/cargo/pkg/data_cache/test/arrow-flight-57.3.0
  • dependabot/cargo/pkg/data_cache/test/arrow-flight-58.0.0
  • dependabot/cargo/pkg/data_cache/test/arrow-flight-58.1.0
  • dependabot/cargo/pkg/data_cache/test/bincode-2.0.1
  • dependabot/cargo/pkg/data_cache/test/bincode-3.0.0
  • dependabot/cargo/pkg/data_cache/test/bytes-1.11.0
  • dependabot/cargo/pkg/data_cache/test/bytes-1.11.1
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.51
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.52
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.53
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.54
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.56
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.57
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.59
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.60
  • dependabot/cargo/pkg/data_cache/test/futures-0.3.32
  • dependabot/cargo/pkg/data_cache/test/serde-1.0.228
  • dependabot/cargo/pkg/data_cache/test/tokio-1.48.0
  • dependabot/cargo/pkg/data_cache/test/tokio-1.49.0
  • dependabot/cargo/pkg/data_cache/test/tokio-1.50.0
  • dependabot/cargo/pkg/data_cache/test/tonic-0.14.2
  • dependabot/cargo/pkg/data_cache/test/tonic-0.14.3
  • dependabot/cargo/pkg/data_cache/test/tonic-0.14.4
  • dependabot/cargo/pkg/data_cache/test/tonic-0.14.5
  • dependabot/cargo/pkg/data_cache/test/tracing-0.1.43
  • dependabot/cargo/pkg/data_cache/test/tracing-0.1.44
  • dependabot/cargo/pkg/data_cache/test/tracing-subscriber-0.3.20
  • dependabot/cargo/pkg/data_cache/test/tracing-subscriber-0.3.22
  • dependabot/cargo/pkg/data_cache/test/tracing-subscriber-0.3.23
  • dependabot/cargo/pkg/data_cache/time-0.3.47
  • dependabot/cargo/pkg/data_cache/tokio-1.44.2
  • dependabot/cargo/pkg/data_cache/tokio-1.48.0
  • dependabot/cargo/pkg/data_cache/tokio-1.49.0
  • dependabot/cargo/pkg/data_cache/tokio-1.50.0
  • dependabot/cargo/pkg/data_cache/tonic-0.14.2
  • dependabot/cargo/pkg/data_cache/tonic-0.14.3
  • dependabot/cargo/pkg/data_cache/tonic-0.14.4
  • dependabot/cargo/pkg/data_cache/tonic-0.14.5
  • dependabot/cargo/pkg/data_cache/tower-0.5.2
  • dependabot/cargo/pkg/data_cache/tower-0.5.3
  • dependabot/cargo/pkg/data_cache/tracing-subscriber-0.3.23
  • dependabot/docker/cmd/data_cache/rust-1.91-bullseye
  • dependabot/docker/cmd/data_cache/rust-1.92-bullseye
  • dependabot/docker/cmd/data_cache/rust-1.93-bullseye
  • dependabot/docker/cmd/data_cache/rust-1.94-bullseye
  • dependabot/docker/cmd/initializers/dataset/python-3.14-slim-bookworm
  • dependabot/docker/cmd/initializers/model/python-3.14-slim-bookworm
  • dependabot/docker/cmd/runtimes/deepspeed/mpioperator/base-v0.7.0
  • dependabot/docker/cmd/runtimes/deepspeed/mpioperator/base-v0.8.0
  • dependabot/docker/cmd/runtimes/deepspeed/nvidia/cuda-13.0.2-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/deepspeed/nvidia/cuda-13.1.0-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/deepspeed/nvidia/cuda-13.1.1-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/deepspeed/nvidia/cuda-13.2.0-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/mlx/mpioperator/base-v0.7.0
  • dependabot/docker/cmd/runtimes/mlx/mpioperator/base-v0.8.0
  • dependabot/docker/cmd/runtimes/mlx/nvidia/cuda-13.0.2-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/mlx/nvidia/cuda-13.1.0-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/mlx/nvidia/cuda-13.1.1-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/mlx/nvidia/cuda-13.2.0-devel-ubuntu22.04
  • dependabot/docker/cmd/trainer-controller-manager/golang-1.25
  • dependabot/docker/cmd/trainer-controller-manager/golang-1.26
  • dependabot/docker/cmd/trainers/torchtune/pytorch/pytorch-2.10.0-cuda12.8-cudnn9-runtime
  • dependabot/docker/cmd/trainers/torchtune/pytorch/pytorch-2.11.0-cuda12.8-cudnn9-runtime
  • dependabot/docker/cmd/trainers/torchtune/pytorch/pytorch-2.9.0-cuda12.8-cudnn9-runtime
  • dependabot/docker/cmd/trainers/torchtune/pytorch/pytorch-2.9.1-cuda12.8-cudnn9-runtime
  • dependabot/github_actions/actions/checkout-5
  • dependabot/github_actions/actions/checkout-6
  • dependabot/github_actions/actions/github-script-8
  • dependabot/github_actions/actions/setup-go-6
  • dependabot/github_actions/actions/setup-python-6
  • dependabot/github_actions/actions/stale-10
  • dependabot/github_actions/actions/upload-artifact-5
  • dependabot/github_actions/actions/upload-artifact-6
  • dependabot/github_actions/actions/upload-artifact-7
  • dependabot/github_actions/amannn/action-semantic-pull-request-6.1.1
  • dependabot/github_actions/aquasecurity/trivy-action-0.33.1
  • dependabot/github_actions/aquasecurity/trivy-action-0.34.0
  • dependabot/github_actions/aquasecurity/trivy-action-0.34.1
  • dependabot/github_actions/aquasecurity/trivy-action-0.34.2
  • dependabot/github_actions/aquasecurity/trivy-action-0.35.0
  • dependabot/github_actions/docker/login-action-4
  • dependabot/github_actions/dot-github/workflows/aquasecurity/trivy-action-0.34.0
  • dependabot/github_actions/github/codeql-action-4
  • dependabot/go_modules/github.com/go-jose/go-jose/v4-4.1.4
  • dependabot/go_modules/github.com/onsi/ginkgo/v2-2.27.2
  • dependabot/go_modules/github.com/onsi/ginkgo/v2-2.27.3
  • dependabot/go_modules/github.com/onsi/ginkgo/v2-2.27.5
  • dependabot/go_modules/github.com/onsi/ginkgo/v2-2.28.1
  • dependabot/go_modules/github.com/onsi/gomega-1.38.3
  • dependabot/go_modules/github.com/onsi/gomega-1.39.0
  • dependabot/go_modules/github.com/onsi/gomega-1.39.1
  • dependabot/go_modules/github.com/open-policy-agent/cert-controller-0.15.0
  • dependabot/go_modules/github.com/open-policy-agent/cert-controller-0.16.0
  • dependabot/go_modules/go.uber.org/zap-1.27.1
  • dependabot/go_modules/golang-8c88b1e330
  • dependabot/go_modules/golang-c94709d3c3
  • dependabot/go_modules/golang-ce64870c5e
  • dependabot/go_modules/golang-cf2caa1bb8
  • dependabot/go_modules/golang-edfadaf7f0
  • dependabot/go_modules/golang-f180a085e8
  • dependabot/go_modules/golang.org/x/crypto-0.45.0
  • dependabot/go_modules/golang.org/x/net-0.38.0
  • dependabot/go_modules/golang.org/x/oauth2-0.27.0
  • dependabot/go_modules/kubernetes-13c179eb27
  • dependabot/go_modules/kubernetes-203b3330f8
  • dependabot/go_modules/kubernetes-2b83cfd1e1
  • dependabot/go_modules/kubernetes-33780c5637
  • dependabot/go_modules/kubernetes-33cfdb17df
  • dependabot/go_modules/kubernetes-46bc08174d
  • dependabot/go_modules/kubernetes-bc4ec63014
  • dependabot/go_modules/kubernetes-bd430bb9c9
  • dependabot/go_modules/kubernetes-df4453129a
  • dependabot/go_modules/kubernetes-e0300699ac
  • dependabot/go_modules/kubernetes-faa114bc83
  • dependabot/go_modules/kubernetes-fdea40109e
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.2
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.3
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.4
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.5
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.6
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.7
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.8
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.9
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.2
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.3
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.4
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.5
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.6
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.7
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.8
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.9
  • dependabot/pip/cmd/runtimes/deepspeed/datasets-4.4.1
  • dependabot/pip/cmd/runtimes/deepspeed/datasets-4.4.2
  • dependabot/pip/cmd/runtimes/deepspeed/datasets-4.5.0
  • dependabot/pip/cmd/runtimes/deepspeed/datasets-4.6.1
  • dependabot/pip/cmd/runtimes/deepspeed/datasets-4.7.0
  • dependabot/pip/cmd/runtimes/deepspeed/datasets-4.8.4
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.2
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.3
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.4
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.5
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.6
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.7
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.8
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.9
  • dependabot/pip/cmd/runtimes/deepspeed/mpi4py-4.1.1
  • dependabot/pip/cmd/runtimes/deepspeed/sentencepiece-0.2.1
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.10.0
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.11.0
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.6.0
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.7.1
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.8.0
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.9.0
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.9.1
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.51.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.52.1
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.53.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.57.1
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.57.2
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.57.3
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.57.6
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-5.0.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-5.1.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-5.2.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-5.3.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-5.4.0
  • dependabot/pip/cmd/runtimes/mlx/datasets-4.4.1
  • dependabot/pip/cmd/runtimes/mlx/datasets-4.4.2
  • dependabot/pip/cmd/runtimes/mlx/datasets-4.5.0
  • dependabot/pip/cmd/runtimes/mlx/datasets-4.6.1
  • dependabot/pip/cmd/runtimes/mlx/datasets-4.7.0
  • dependabot/pip/cmd/runtimes/mlx/datasets-4.8.4
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.29.3
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.30.0
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.30.1
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.30.3
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.30.5
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.30.6
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.31.0
  • dependabot/pip/cmd/runtimes/mlx/mlx-data-0.2.0
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.28.3
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.28.4
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.30.0
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.30.2
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.30.4
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.30.5
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.30.6
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.31.0
  • dependabot/pip/cmd/trainers/torchtune/torchao-0.16.0
  • dependabot/pip/cmd/trainers/torchtune/torchao-0.17.0
  • docs/local-examples-gpu-support
  • docs/local-iceberg-validation
  • dont-merge-gpu-label-test
  • e2e-debug-clean
  • e2e-test-through-helm
  • example/trainjob-yaml
  • fail-gpu-e2e
  • feat/add-coscheduling-uts
  • feat/add-helm-ci-checks
  • feat/add-securitycontext-support-trainjob
  • feat/add-version-file
  • feat/automate-release
  • feat/config-unit-tests
  • feat/ctr-webhook
  • feat/dataset-preprocess
  • feat/enforce-runtime-info-plugin
  • feat/example/add-speech-recognition-with-ddp-example
  • feat/helm-data-cache-config
  • feat/initializers/s3
  • feat/llama3_2-manifests
  • feat/llm-trainer-v2
  • feat/local-model
  • feat/lora-support
  • feat/move-enablehttp2-to-config
  • feat/pvc-check
  • feat/replica-valid
  • feat/sdk-torchtune-config
  • feat/statusserver-healthz-probe
  • feat/termination-grace-period-patch
  • feat/torchtune-plugin
  • feat/trainer-multi-slice-tpu
  • feat/trainjob-affinity
  • feat/trainjob-imagepullsecrets
  • feat/webhook-validate-trainjob-name
  • feat/webhook/rfc1035
  • feature-gate-scaffolding
  • feature/add-xgboost-runtime
  • feature/debabrata
  • feature/framework-env-conflict-validation
  • feature/helm-charts-v2
  • feature/kube-linter-3096
  • feature/multi-replica-replicatedjobs
  • feature/support-for-ClusterTrainingRuntimes
  • fix-arg-for-get-args-using-torchtune-config
  • fix-close-pr-message
  • fix-controller-rbac
  • fix-coveralls
  • fix-crd-cel-namespace
  • fix-deepspeed-example
  • fix-deepspeed-npoc
  • fix-e2e
  • fix-e2e-sdk-install
  • fix-e2e-test
  • fix-example-runtime
  • fix-helm-chart-name
  • fix-helm-charts-config-api-2894
  • fix-helm-lint
  • fix-helm-test
  • fix-helm-unittest-logic-clean
  • fix-immutable-apis
  • fix-kep-volcano
  • fix-latest-dev
  • fix-latest-tag
  • fix-llm-hp-optimization-error
  • fix-local-tests
  • fix-mlx-runtime
  • fix-mpi-key-mode
  • fix-oci-vm-tf
  • fix-outdated-intstr-lib
  • fix-permissions
  • fix-python-release-version
  • fix-readonly-rootfs
  • fix-release-doc
  • fix-release-steps
  • fix-resource-allocation
  • fix-serviceaccount-name
  • fix-suspend-resume-3008
  • fix-tag-manager
  • fix-test-bug
  • fix-torch-compile
  • fix-torchtune-runtime-deps
  • fix-trainer-type-annotation
  • fix/allow-podtemplate-update-on-unsuspend
  • fix/cert-and-issuer
  • fix/ci-duplicate-step-name
  • fix/disable-github-actions
  • fix/e2e-platform-mismatch
  • fix/helm-chart
  • fix/issue-template
  • fix/jax-validation
  • fix/kep2401-lint
  • fix/mnist-training-parameters-v2
  • fix/multiple-depends-on
  • fix/notebook-e2e-flaky-completion
  • fix/python-type-import
  • fix/rbac/event
  • fix/remove-jobset-lws-patches
  • fix/runtime-info-thread-safety
  • fix/sync-podsets-count-to-template-spec
  • fix/tidy-KEP-2401
  • fix/torchtune-c-compiler
  • fix/torchtune-plugin
  • fix/torchtune-validation-lora-immutable-args
  • fix/trainjob-status-error
  • flux-framework-plugin
  • gpu-arc-doc
  • gpu-test-on-pr
  • gsoc-2442-jax-runtime-proposal
  • gsoc-pss-istio-fix
  • gsoc25-project7-kep
  • hatchling-package
  • health
  • helm-integration-tests
  • implement-resource-in-use-finalizer
  • implement-resource-in-use-for-cl-training-runtime
  • implement-validation-uts
  • indexers-ut
  • issue-2218-pod-spec-override-kep
  • issue-2547
  • issue-2706-v2-go-mod
  • issue-2789/implement-cluster-training-runtimes-deprecation-process
  • jax-runtime
  • jax-runtime-impl
  • jobset-name-prefix
  • jobset-validation
  • jobset-volume-claim-policies
  • k8s_1.32_upgrade
  • kai-scheduler-2628
  • kai_kep
  • kep-2598-xgboost-runtime
  • kep-2779-trainjob-progress
  • kep-2841-add-flux-hpc
  • kubecon-london-demo
  • kubelow-sdk-release
  • master
  • mlx-cuda-runtime
  • mlx-runtime
  • move-imports
  • obtain-runtimeTemplate-via-info
  • openssf-badge
  • override_label_and_annotation
  • patch-1
  • patch-issue-2027
  • pick/example-alpaca
  • pick/fix-torchtune-plugin
  • pkg/apply_unit-tests
  • plugin/flux
  • pr-15
  • pr-17
  • pr-18
  • pr-19
  • pr-20
  • pr-21
  • pr-22
  • pr-24
  • pr-25
  • pr-26
  • pr-27
  • pr-28
  • pr-29
  • pr-30
  • pr-32
  • pr-33
  • pr-35
  • pr-36
  • pr-37
  • pr-38
  • pr-39
  • pr-41
  • pr-42
  • pr-43
  • pr-44
  • pr-45
  • pr-47
  • pr-created-condition
  • pr-k8s-lint
  • pr-runtime-patches
  • pr-time-webhook
  • pr-title-workflow
  • prometheus
  • proposal
  • proposal-2170
  • pss-istio-fix-clean
  • pss-restricted-fixes
  • public-configmap
  • refactor-named-container-ports
  • refs/tags/v1.9.1
  • refs/tags/v2.0.0-rc.0
  • refs/tags/v2.0.0-rc.1
  • refs/tags/v2.0.1
  • refs/tags/v2.1.0
  • refs/tags/v2.1.0-rc.0
  • refs/tags/v2.1.0-rc.1
  • refs/tags/v2.2.0
  • refs/tags/v2.2.0-rc.0
  • refs/tags/v2.2.0-rc.1
  • release-1.9
  • release-2.0
  • release-2.1
  • release-2.2
  • release-automation
  • release-python-doc
  • remove-command-runtimes
  • remove-elastic-policy
  • remove-k8s-version-matrix
  • remove-mpi
  • remove-num-proc
  • remove-py-packages
  • remove-sdk
  • remove-setcap-cap-net-bind
  • remove-trivy-action
  • remove-vendor-specific-parameters
  • revert-2646-fix-trainer-type-annotation
  • roadmap-2025
  • roadmap-2026
  • rqst-env-only-if-label-present
  • runtime-rbac
  • runtime_fix
  • safe-gpu-test
  • scorecard-workflow
  • script/setup-gpu-cluster2
  • sdk-ancestor-updates
  • sdk-fix-mpirun
  • security-doc
  • separate-models-from-sdk
  • sharedinit
  • solanyn/question-answer-example
  • support-arm-container
  • support-for-gpu-cluster-using-oci-runner
  • support_kai
  • terrytangyuan-patch-1
  • test-cncf-gpu-runner
  • test-gpu-arc
  • test-statusserver-helpers
  • test/fix-flaky-test
  • test/rename-runtime-plugin-tests
  • test/runtime-core-coverage
  • tmp_secret_verify
  • torchrun-var
  • trainer-release
  • training-progression#2779
  • trainjob-progress
  • treat-ancestor-label-as-identifier
  • trivy-scans
  • ttl
  • update-approvers
  • update-examples-with-unpacking-params
  • update-github-runners
  • update-image-tags
  • update-jobset-0.11
  • update-license
  • update-logs-examples
  • update-manifest-images-to-ghcr
  • update-news-v2.2-release
  • update-owners
  • update-readme
  • update-release-process
  • update-sdk-reference
  • update-security-context
  • update-slack
  • update-stale-bot-version
  • update-torch-2.10
  • update-torch-2.9
  • use-tilt
  • validation-mpiruntimes
  • volcano
  • volcano-podgroup-build
  • vuls
  • vzamboulingame-upgrade-go-v1.24
  • was-kep
  • workflow/helm
  • workflow/publish-helm-charts
  • xgboost-runtime-implementation
  • year-cleanup

04 Apr 2026 03:17PM UTC coverage: 58.057%. Remained the same
23981685932

Pull #3201

github

XploY04
fix: remove dist_checkpointing from Megatron notebook

Megatron-Core dist_checkpointing uses multiprocessing.spawn internally
to create a Manager queue for async writes. The Kubeflow SDK generates
training scripts without an if __name__ == '__main__' guard, so the
spawned child re-imports the script and re-executes the training
function, causing a RuntimeError. Remove the checkpoint step since
training (Steps 1-5) is the core TP demonstration. Also remove unused
imports and fix the GPU prerequisites text.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Pull Request #3201: feat: add Megatron-Core GPT Tensor Parallelism example notebook

2032 of 3500 relevant lines covered (58.06%)

0.67 hits per line

Relevant lines Covered
Build:
Build:
3500 RELEVANT LINES 2032 COVERED LINES
0.67 HITS PER LINE
Source Files on megatron
  • Tree
  • List 40
  • Changed 0
  • Source Changed 0
  • Coverage Changed 0
Coverage ∆ File Lines Relevant Covered Missed Hits/Line

Recent builds

Builds Branch Commit Type Ran Committer Via Coverage
23981685932 megatron fix: remove dist_checkpointing from Megatron notebook Megatron-Core dist_checkpointing uses multiprocessing.spawn internally to create a Manager queue for async writes. The Kubeflow SDK generates training scripts without an if __name__ == '__main... Pull #3201 04 Apr 2026 03:21PM UTC XploY04 github
58.06
23981021700 megatron fix: remove dist_checkpointing from Megatron notebook Megatron-Core dist_checkpointing uses multiprocessing.spawn internally to create a Manager queue for async writes. The Kubeflow SDK generates training scripts without an if __name__ == '__main... Pull #3201 04 Apr 2026 02:42PM UTC XploY04 github
58.14
23979976051 megatron fix: mount /dev/shm as emptyDir to fix NCCL shared memory exhaustion NCCL proxy service allocates ~33MB per communicator in /dev/shm. The default Kubernetes /dev/shm is 64MB (Docker default), which is insufficient for workloads that create multip... Pull #3201 04 Apr 2026 01:39PM UTC XploY04 github
58.14
23972072192 megatron debug: add NCCL_DEBUG=INFO to diagnose /dev/shm failure on node-1 Signed-off-by: XploY04 <2004agarwalyash@gmail.com> Pull #3201 04 Apr 2026 05:19AM UTC XploY04 github
58.06
23529197530 megatron fix: disable NCCL shared memory to avoid /dev/shm limit in Kubernetes Kubernetes pods get 64MB /dev/shm by default. NCCL needs more for its communication buffers, causing checkpoint save to fail with "Error while creating shared memory segment". ... Pull #3201 25 Mar 2026 07:11AM UTC XploY04 github
58.06
23523506891 megatron fix: use runtime image with apt-get for build tools instead of devel image The devel image (3.5GB) is too slow to pull, causing PAPERMILL_TIMEOUT. Switch back to the runtime image and install only make and g++ via apt-get (~50MB). This provides e... Pull #3201 25 Mar 2026 03:33AM UTC XploY04 github
58.06
23497195914 megatron fix: increase wait_for_job_status timeout to 1800s The megatron training job takes longer than 900s due to pip install, C++ compilation, and multi-node training setup on time-sliced GPUs. Increase timeout to 1800s to match PAPERMILL_TIMEOUT. Sig... Pull #3201 24 Mar 2026 03:22PM UTC XploY04 github
58.06
23485235636 megatron fix: use correct null tokenizer library name for pip megatron-core The pip release of megatron-core uses "null" as the tokenizer library name, while the GitHub main branch uses "null-text". Change to "null" to match the pip version. Signed-off-b... Pull #3201 24 Mar 2026 10:43AM UTC XploY04 github
58.14
23481258188 megatron fix: download Megatron C++ source files for compile_helpers() The megatron-core pip package excludes the Makefile and helpers.cpp needed by compile_helpers(). Download these two files from GitHub at runtime before compilation. Also add pybind11 a... Pull #3201 24 Mar 2026 09:04AM UTC XploY04 github
58.06
23479733760 megatron fix: install megatron-core from zip archive instead of git The devel image doesn't have git installed, so pip install from git+https fails. Use the GitHub zip archive URL instead, which pip can download and extract directly without needing git. ... Pull #3201 24 Mar 2026 08:23AM UTC XploY04 github
58.06
See All Builds (2691)

Badge your Repo: trainer

We detected this repo isn’t badged! Grab the embed code to the right, add it to your repo to show off your code coverage, and when the badge is live hit the refresh button to remove this message.

Could not find badge in README.

Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

Refresh
  • Settings
  • Repo on GitHub
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc