• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

kubeflow / trainer
58%
master: 58%

Build:
Build:
LAST BUILD BRANCH: feat/trainer-multi-slice-tpu
DEFAULT BRANCH: master
Repo Added 20 Mar 2025 01:49PM UTC
Token 3qIdUH6ns6RNy0sBPPQ6ybJp7VqYkScU8 regen
Build 3006 Last
Files 40
Badge
Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

LAST BUILD ON BRANCH feat/trainer-multi-slice-tpu
branch: SELECT
CHANGE BRANCH
x
Sync Branches
  • No branch selected
  • 2836-expose-builruntimeinfo
  • 2871-allow-podspecoverride-dupl-jobs
  • Bug
  • KEP-volcano-scheduler
  • Xgboost-E2E-renable
  • a10-2-gpu
  • add-akshay-reviewer
  • add-audio-examples
  • add-config-api-tests-2885
  • add-core-runtimes-function
  • add-dependabot
  • add-gitattr
  • add-gpu-e2e-timeout
  • add-license-scan-badge
  • add-local-example
  • add-local-trainer-client
  • add-local-trainer-example
  • add-manager-field-podtemplateoverride
  • add-ok-to-test
  • add-overlay-manifest-v2
  • add-patch-updates-k8s
  • add-pod-network-plugin-to-diagram
  • add-qwen3-1.7b
  • add-r-generation
  • add-runtime-labels
  • add-sdk-release
  • add-standalone-manifest
  • agents-md
  • automate-release
  • bo/feat/remove-launcher-chainer-validation
  • bo/test/add-ut-for-torch-runtime-valid
  • bump-go-1.25
  • bump-jobset-v0.9.0
  • bump-master-2.2
  • bump-torch-deepspeed
  • bump-trivy-0.69.2
  • cache-example
  • cache-oss
  • cache_initilizer
  • cache_pipeline
  • changelog-1.9.1
  • changelog-2.0.0
  • changelog-2.0.1
  • changelog-v2.0.0-rc.0
  • changelog-v2.0.0-rc.1
  • changelog-v2.1.0
  • changelog-v2.1.0-rc.0
  • changelog-v2.1.0-rc.1
  • cherry-pick-2666-to-release-2.0
  • cherry-pick-2675-to-release-2.0
  • cherry-pick-2682-to-release-2.0
  • cherry-pick-2683-to-release-2.0
  • cherry-pick-2685-to-release-2.0
  • cherry-pick-2686-to-release-2.0
  • cherry-pick-2691-to-release-2.0
  • cherry-pick-2695-to-release-2.0
  • cherry-pick-2700-to-release-2.0
  • cherry-pick-2703-to-release-2.0
  • cherry-pick-2707-to-release-2.0
  • cherry-pick-2719-to-release-2.0
  • cherry-pick-2726-to-release-2.0
  • cherry-pick-2728-to-release-2.1
  • cherry-pick-2731-to-release-2.0
  • cherry-pick-2734-to-release-2.0
  • cherry-pick-2739-to-release-2.0
  • cherry-pick-2761
  • cherry-pick-2766
  • cherry-pick-2771-to-release-2.0
  • cherry-pick-2774-to-release-2.0
  • cherry-pick-2780
  • cherry-pick-2813
  • cherry-pick-2815
  • cherry-pick-2837-to-release-2.0
  • cherry-pick-2854-to-release-2.0
  • cherry-pick-2877-to-release-2.1
  • cherry-pick-2904-to-release-2.1
  • cherry-pick-2907-to-release-2.1
  • cherry-pick-2908-to-release-2.1
  • cherry-pick-2913-to-release-2.1
  • cherry-pick-2923-to-release-2.1
  • cherry-pick-2926-to-release-2.1
  • cherry-pick-2971-to-release-2.1
  • cherry-pick-3009-to-release-2.1
  • cherry-pick-3010-to-release-2.1
  • cherry-pick-3307-to-release-2.2
  • cherry-pick-3319-to-release-2.2
  • cherry-pick-3322-to-release-2.2
  • cherry-pick-3323-to-release-2.2
  • cherry-pick-3331-to-release-2.2
  • cherry-pick-3333-to-release-2.2
  • cherry-pick-3335-to-release-2.2
  • cherry-pick-3360-to-release-2.2
  • cherry-pick-changelog-1.9
  • chore/KEP-runtime-class
  • chore/gha
  • chore/merge-podspacoverride-test-cases
  • chore/remove-copyright-year
  • chore/rename-certmanagement-config-fields
  • chore/upgrade-torchtune-version
  • ci/include-1.32-k8s
  • claude-symlink
  • code-quality-check
  • code-quality-clean
  • config-api-implementation
  • coscheduling-indexers-ut
  • deepspeed-runtime
  • dependabot/cargo/pkg/data_cache/arrow-56.2.0
  • dependabot/cargo/pkg/data_cache/arrow-57.0.0
  • dependabot/cargo/pkg/data_cache/arrow-57.1.0
  • dependabot/cargo/pkg/data_cache/arrow-57.2.0
  • dependabot/cargo/pkg/data_cache/arrow-58.0.0
  • dependabot/cargo/pkg/data_cache/arrow-58.1.0
  • dependabot/cargo/pkg/data_cache/arrow-flight-56.2.0
  • dependabot/cargo/pkg/data_cache/arrow-flight-57.1.0
  • dependabot/cargo/pkg/data_cache/arrow-flight-57.2.0
  • dependabot/cargo/pkg/data_cache/arrow-flight-58.0.0
  • dependabot/cargo/pkg/data_cache/arrow-flight-58.1.0
  • dependabot/cargo/pkg/data_cache/arrow-schema-56.2.0
  • dependabot/cargo/pkg/data_cache/arrow-schema-57.2.0
  • dependabot/cargo/pkg/data_cache/async-trait-0.1.89
  • dependabot/cargo/pkg/data_cache/axum-0.8.8
  • dependabot/cargo/pkg/data_cache/bincode-2.0.1
  • dependabot/cargo/pkg/data_cache/bincode-3.0.0
  • dependabot/cargo/pkg/data_cache/bytes-1.11.0
  • dependabot/cargo/pkg/data_cache/bytes-1.11.1
  • dependabot/cargo/pkg/data_cache/crossbeam-channel-0.5.15
  • dependabot/cargo/pkg/data_cache/datafusion-51.0.0
  • dependabot/cargo/pkg/data_cache/futures-0.3.32
  • dependabot/cargo/pkg/data_cache/hickory-resolver-0.25.2
  • dependabot/cargo/pkg/data_cache/iceberg-0.6.0
  • dependabot/cargo/pkg/data_cache/iceberg-0.7.0
  • dependabot/cargo/pkg/data_cache/iceberg-0.8.0
  • dependabot/cargo/pkg/data_cache/iceberg-0.9.0
  • dependabot/cargo/pkg/data_cache/iceberg-datafusion-0.6.0
  • dependabot/cargo/pkg/data_cache/iceberg-datafusion-0.7.0
  • dependabot/cargo/pkg/data_cache/iceberg-datafusion-0.8.0
  • dependabot/cargo/pkg/data_cache/lz4_flex-0.11.6
  • dependabot/cargo/pkg/data_cache/quinn-proto-0.11.14
  • dependabot/cargo/pkg/data_cache/ring-0.17.14
  • dependabot/cargo/pkg/data_cache/rustls-webpki-0.103.10
  • dependabot/cargo/pkg/data_cache/serde-1.0.228
  • dependabot/cargo/pkg/data_cache/test/arrow-flight-57.0.0
  • dependabot/cargo/pkg/data_cache/test/arrow-flight-57.1.0
  • dependabot/cargo/pkg/data_cache/test/arrow-flight-57.2.0
  • dependabot/cargo/pkg/data_cache/test/arrow-flight-57.3.0
  • dependabot/cargo/pkg/data_cache/test/arrow-flight-58.0.0
  • dependabot/cargo/pkg/data_cache/test/arrow-flight-58.1.0
  • dependabot/cargo/pkg/data_cache/test/bincode-2.0.1
  • dependabot/cargo/pkg/data_cache/test/bincode-3.0.0
  • dependabot/cargo/pkg/data_cache/test/bytes-1.11.0
  • dependabot/cargo/pkg/data_cache/test/bytes-1.11.1
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.51
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.52
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.53
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.54
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.56
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.57
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.59
  • dependabot/cargo/pkg/data_cache/test/clap-4.5.60
  • dependabot/cargo/pkg/data_cache/test/futures-0.3.32
  • dependabot/cargo/pkg/data_cache/test/serde-1.0.228
  • dependabot/cargo/pkg/data_cache/test/tokio-1.48.0
  • dependabot/cargo/pkg/data_cache/test/tokio-1.49.0
  • dependabot/cargo/pkg/data_cache/test/tokio-1.50.0
  • dependabot/cargo/pkg/data_cache/test/tonic-0.14.2
  • dependabot/cargo/pkg/data_cache/test/tonic-0.14.3
  • dependabot/cargo/pkg/data_cache/test/tonic-0.14.4
  • dependabot/cargo/pkg/data_cache/test/tonic-0.14.5
  • dependabot/cargo/pkg/data_cache/test/tracing-0.1.43
  • dependabot/cargo/pkg/data_cache/test/tracing-0.1.44
  • dependabot/cargo/pkg/data_cache/test/tracing-subscriber-0.3.20
  • dependabot/cargo/pkg/data_cache/test/tracing-subscriber-0.3.22
  • dependabot/cargo/pkg/data_cache/test/tracing-subscriber-0.3.23
  • dependabot/cargo/pkg/data_cache/time-0.3.47
  • dependabot/cargo/pkg/data_cache/tokio-1.44.2
  • dependabot/cargo/pkg/data_cache/tokio-1.48.0
  • dependabot/cargo/pkg/data_cache/tokio-1.49.0
  • dependabot/cargo/pkg/data_cache/tokio-1.50.0
  • dependabot/cargo/pkg/data_cache/tonic-0.14.2
  • dependabot/cargo/pkg/data_cache/tonic-0.14.3
  • dependabot/cargo/pkg/data_cache/tonic-0.14.4
  • dependabot/cargo/pkg/data_cache/tonic-0.14.5
  • dependabot/cargo/pkg/data_cache/tower-0.5.2
  • dependabot/cargo/pkg/data_cache/tower-0.5.3
  • dependabot/cargo/pkg/data_cache/tracing-subscriber-0.3.23
  • dependabot/docker/cmd/data_cache/rust-1.91-bullseye
  • dependabot/docker/cmd/data_cache/rust-1.92-bullseye
  • dependabot/docker/cmd/data_cache/rust-1.93-bullseye
  • dependabot/docker/cmd/data_cache/rust-1.94-bullseye
  • dependabot/docker/cmd/initializers/dataset/python-3.14-slim-bookworm
  • dependabot/docker/cmd/initializers/model/python-3.14-slim-bookworm
  • dependabot/docker/cmd/runtimes/deepspeed/mpioperator/base-v0.7.0
  • dependabot/docker/cmd/runtimes/deepspeed/mpioperator/base-v0.8.0
  • dependabot/docker/cmd/runtimes/deepspeed/nvidia/cuda-13.0.2-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/deepspeed/nvidia/cuda-13.1.0-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/deepspeed/nvidia/cuda-13.1.1-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/deepspeed/nvidia/cuda-13.2.0-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/mlx/mpioperator/base-v0.7.0
  • dependabot/docker/cmd/runtimes/mlx/mpioperator/base-v0.8.0
  • dependabot/docker/cmd/runtimes/mlx/nvidia/cuda-13.0.2-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/mlx/nvidia/cuda-13.1.0-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/mlx/nvidia/cuda-13.1.1-devel-ubuntu22.04
  • dependabot/docker/cmd/runtimes/mlx/nvidia/cuda-13.2.0-devel-ubuntu22.04
  • dependabot/docker/cmd/trainer-controller-manager/golang-1.25
  • dependabot/docker/cmd/trainer-controller-manager/golang-1.26
  • dependabot/docker/cmd/trainers/torchtune/pytorch/pytorch-2.10.0-cuda12.8-cudnn9-runtime
  • dependabot/docker/cmd/trainers/torchtune/pytorch/pytorch-2.11.0-cuda12.8-cudnn9-runtime
  • dependabot/docker/cmd/trainers/torchtune/pytorch/pytorch-2.9.0-cuda12.8-cudnn9-runtime
  • dependabot/docker/cmd/trainers/torchtune/pytorch/pytorch-2.9.1-cuda12.8-cudnn9-runtime
  • dependabot/github_actions/actions/checkout-5
  • dependabot/github_actions/actions/checkout-6
  • dependabot/github_actions/actions/github-script-8
  • dependabot/github_actions/actions/setup-go-6
  • dependabot/github_actions/actions/setup-python-6
  • dependabot/github_actions/actions/stale-10
  • dependabot/github_actions/actions/upload-artifact-5
  • dependabot/github_actions/actions/upload-artifact-6
  • dependabot/github_actions/actions/upload-artifact-7
  • dependabot/github_actions/amannn/action-semantic-pull-request-6.1.1
  • dependabot/github_actions/aquasecurity/trivy-action-0.33.1
  • dependabot/github_actions/aquasecurity/trivy-action-0.34.0
  • dependabot/github_actions/aquasecurity/trivy-action-0.34.1
  • dependabot/github_actions/aquasecurity/trivy-action-0.34.2
  • dependabot/github_actions/aquasecurity/trivy-action-0.35.0
  • dependabot/github_actions/docker/login-action-4
  • dependabot/github_actions/dot-github/workflows/aquasecurity/trivy-action-0.34.0
  • dependabot/github_actions/github/codeql-action-4
  • dependabot/go_modules/github.com/go-jose/go-jose/v4-4.1.4
  • dependabot/go_modules/github.com/onsi/ginkgo/v2-2.27.2
  • dependabot/go_modules/github.com/onsi/ginkgo/v2-2.27.3
  • dependabot/go_modules/github.com/onsi/ginkgo/v2-2.27.5
  • dependabot/go_modules/github.com/onsi/ginkgo/v2-2.28.1
  • dependabot/go_modules/github.com/onsi/gomega-1.38.3
  • dependabot/go_modules/github.com/onsi/gomega-1.39.0
  • dependabot/go_modules/github.com/onsi/gomega-1.39.1
  • dependabot/go_modules/github.com/open-policy-agent/cert-controller-0.15.0
  • dependabot/go_modules/github.com/open-policy-agent/cert-controller-0.16.0
  • dependabot/go_modules/go.uber.org/zap-1.27.1
  • dependabot/go_modules/golang-8c88b1e330
  • dependabot/go_modules/golang-c94709d3c3
  • dependabot/go_modules/golang-ce64870c5e
  • dependabot/go_modules/golang-cf2caa1bb8
  • dependabot/go_modules/golang-edfadaf7f0
  • dependabot/go_modules/golang-f180a085e8
  • dependabot/go_modules/golang.org/x/crypto-0.45.0
  • dependabot/go_modules/golang.org/x/net-0.38.0
  • dependabot/go_modules/golang.org/x/oauth2-0.27.0
  • dependabot/go_modules/kubernetes-13c179eb27
  • dependabot/go_modules/kubernetes-203b3330f8
  • dependabot/go_modules/kubernetes-2b83cfd1e1
  • dependabot/go_modules/kubernetes-33780c5637
  • dependabot/go_modules/kubernetes-33cfdb17df
  • dependabot/go_modules/kubernetes-46bc08174d
  • dependabot/go_modules/kubernetes-bc4ec63014
  • dependabot/go_modules/kubernetes-bd430bb9c9
  • dependabot/go_modules/kubernetes-df4453129a
  • dependabot/go_modules/kubernetes-e0300699ac
  • dependabot/go_modules/kubernetes-faa114bc83
  • dependabot/go_modules/kubernetes-fdea40109e
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.2
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.3
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.4
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.5
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.6
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.7
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.8
  • dependabot/pip/cmd/initializers/dataset/huggingface-hub-gte-0.27.0-and-lt-1.9
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.2
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.3
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.4
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.5
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.6
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.7
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.8
  • dependabot/pip/cmd/initializers/model/huggingface-hub-gte-0.27.0-and-lt-1.9
  • dependabot/pip/cmd/runtimes/deepspeed/datasets-4.4.1
  • dependabot/pip/cmd/runtimes/deepspeed/datasets-4.4.2
  • dependabot/pip/cmd/runtimes/deepspeed/datasets-4.5.0
  • dependabot/pip/cmd/runtimes/deepspeed/datasets-4.6.1
  • dependabot/pip/cmd/runtimes/deepspeed/datasets-4.7.0
  • dependabot/pip/cmd/runtimes/deepspeed/datasets-4.8.4
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.2
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.3
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.4
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.5
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.6
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.7
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.8
  • dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.9
  • dependabot/pip/cmd/runtimes/deepspeed/mpi4py-4.1.1
  • dependabot/pip/cmd/runtimes/deepspeed/sentencepiece-0.2.1
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.10.0
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.11.0
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.6.0
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.7.1
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.8.0
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.9.0
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.9.1
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.51.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.52.1
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.53.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.57.1
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.57.2
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.57.3
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.57.6
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-5.0.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-5.1.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-5.2.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-5.3.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-5.4.0
  • dependabot/pip/cmd/runtimes/mlx/datasets-4.4.1
  • dependabot/pip/cmd/runtimes/mlx/datasets-4.4.2
  • dependabot/pip/cmd/runtimes/mlx/datasets-4.5.0
  • dependabot/pip/cmd/runtimes/mlx/datasets-4.6.1
  • dependabot/pip/cmd/runtimes/mlx/datasets-4.7.0
  • dependabot/pip/cmd/runtimes/mlx/datasets-4.8.4
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.29.3
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.30.0
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.30.1
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.30.3
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.30.5
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.30.6
  • dependabot/pip/cmd/runtimes/mlx/mlx-cuda--0.31.0
  • dependabot/pip/cmd/runtimes/mlx/mlx-data-0.2.0
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.28.3
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.28.4
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.30.0
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.30.2
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.30.4
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.30.5
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.30.6
  • dependabot/pip/cmd/runtimes/mlx/mlx-lm-0.31.0
  • dependabot/pip/cmd/trainers/torchtune/torchao-0.16.0
  • dependabot/pip/cmd/trainers/torchtune/torchao-0.17.0
  • docs/local-examples-gpu-support
  • docs/local-iceberg-validation
  • dont-merge-gpu-label-test
  • e2e-debug-clean
  • e2e-test-through-helm
  • example/trainjob-yaml
  • fail-gpu-e2e
  • feat/add-coscheduling-uts
  • feat/add-helm-ci-checks
  • feat/add-securitycontext-support-trainjob
  • feat/add-version-file
  • feat/automate-release
  • feat/config-unit-tests
  • feat/ctr-webhook
  • feat/dataset-preprocess
  • feat/enforce-runtime-info-plugin
  • feat/example/add-speech-recognition-with-ddp-example
  • feat/helm-data-cache-config
  • feat/initializers/s3
  • feat/llama3_2-manifests
  • feat/llm-trainer-v2
  • feat/local-model
  • feat/lora-support
  • feat/move-enablehttp2-to-config
  • feat/pvc-check
  • feat/replica-valid
  • feat/sdk-torchtune-config
  • feat/statusserver-healthz-probe
  • feat/termination-grace-period-patch
  • feat/torchtune-plugin
  • feat/trainer-multi-slice-tpu
  • feat/trainjob-affinity
  • feat/trainjob-imagepullsecrets
  • feat/webhook-validate-trainjob-name
  • feat/webhook/rfc1035
  • feature-gate-scaffolding
  • feature/add-xgboost-runtime
  • feature/debabrata
  • feature/framework-env-conflict-validation
  • feature/helm-charts-v2
  • feature/kube-linter-3096
  • feature/multi-replica-replicatedjobs
  • feature/support-for-ClusterTrainingRuntimes
  • fix-arg-for-get-args-using-torchtune-config
  • fix-close-pr-message
  • fix-controller-rbac
  • fix-coveralls
  • fix-crd-cel-namespace
  • fix-deepspeed-example
  • fix-deepspeed-npoc
  • fix-e2e
  • fix-e2e-sdk-install
  • fix-e2e-test
  • fix-example-runtime
  • fix-helm-chart-name
  • fix-helm-charts-config-api-2894
  • fix-helm-lint
  • fix-helm-test
  • fix-helm-unittest-logic-clean
  • fix-immutable-apis
  • fix-kep-volcano
  • fix-latest-dev
  • fix-latest-tag
  • fix-llm-hp-optimization-error
  • fix-local-tests
  • fix-mlx-runtime
  • fix-mpi-key-mode
  • fix-oci-vm-tf
  • fix-outdated-intstr-lib
  • fix-permissions
  • fix-python-release-version
  • fix-readonly-rootfs
  • fix-release-doc
  • fix-release-steps
  • fix-resource-allocation
  • fix-serviceaccount-name
  • fix-suspend-resume-3008
  • fix-tag-manager
  • fix-test-bug
  • fix-torch-compile
  • fix-torchtune-runtime-deps
  • fix-trainer-type-annotation
  • fix/allow-podtemplate-update-on-unsuspend
  • fix/cert-and-issuer
  • fix/ci-duplicate-step-name
  • fix/disable-github-actions
  • fix/e2e-platform-mismatch
  • fix/helm-chart
  • fix/issue-template
  • fix/jax-validation
  • fix/kep2401-lint
  • fix/mnist-training-parameters-v2
  • fix/multiple-depends-on
  • fix/notebook-e2e-flaky-completion
  • fix/python-type-import
  • fix/rbac/event
  • fix/remove-jobset-lws-patches
  • fix/runtime-info-thread-safety
  • fix/sync-podsets-count-to-template-spec
  • fix/tidy-KEP-2401
  • fix/torchtune-c-compiler
  • fix/torchtune-plugin
  • fix/torchtune-validation-lora-immutable-args
  • fix/trainjob-status-error
  • flux-framework-plugin
  • gpu-arc-doc
  • gpu-test-on-pr
  • gsoc-2442-jax-runtime-proposal
  • gsoc-pss-istio-fix
  • gsoc25-project7-kep
  • hatchling-package
  • health
  • helm-integration-tests
  • implement-resource-in-use-finalizer
  • implement-resource-in-use-for-cl-training-runtime
  • implement-validation-uts
  • indexers-ut
  • issue-2218-pod-spec-override-kep
  • issue-2547
  • issue-2706-v2-go-mod
  • issue-2789/implement-cluster-training-runtimes-deprecation-process
  • jax-runtime
  • jax-runtime-impl
  • jobset-name-prefix
  • jobset-validation
  • jobset-volume-claim-policies
  • k8s_1.32_upgrade
  • kai-scheduler-2628
  • kai_kep
  • kep-2598-xgboost-runtime
  • kep-2779-trainjob-progress
  • kep-2841-add-flux-hpc
  • kubecon-london-demo
  • kubelow-sdk-release
  • master
  • megatron
  • mlx-cuda-runtime
  • mlx-runtime
  • move-imports
  • obtain-runtimeTemplate-via-info
  • openssf-badge
  • override_label_and_annotation
  • patch-1
  • patch-issue-2027
  • pick/example-alpaca
  • pick/fix-torchtune-plugin
  • pkg/apply_unit-tests
  • plugin/flux
  • pr-15
  • pr-17
  • pr-18
  • pr-19
  • pr-20
  • pr-21
  • pr-22
  • pr-24
  • pr-25
  • pr-26
  • pr-27
  • pr-28
  • pr-29
  • pr-30
  • pr-32
  • pr-33
  • pr-35
  • pr-36
  • pr-37
  • pr-38
  • pr-39
  • pr-41
  • pr-42
  • pr-43
  • pr-44
  • pr-45
  • pr-47
  • pr-created-condition
  • pr-k8s-lint
  • pr-runtime-patches
  • pr-time-webhook
  • pr-title-workflow
  • prometheus
  • proposal
  • proposal-2170
  • pss-istio-fix-clean
  • pss-restricted-fixes
  • public-configmap
  • refactor-named-container-ports
  • refs/tags/v1.9.1
  • refs/tags/v2.0.0-rc.0
  • refs/tags/v2.0.0-rc.1
  • refs/tags/v2.0.1
  • refs/tags/v2.1.0
  • refs/tags/v2.1.0-rc.0
  • refs/tags/v2.1.0-rc.1
  • refs/tags/v2.2.0
  • refs/tags/v2.2.0-rc.0
  • refs/tags/v2.2.0-rc.1
  • release-1.9
  • release-2.0
  • release-2.1
  • release-2.2
  • release-automation
  • release-python-doc
  • remove-command-runtimes
  • remove-elastic-policy
  • remove-k8s-version-matrix
  • remove-mpi
  • remove-num-proc
  • remove-py-packages
  • remove-sdk
  • remove-setcap-cap-net-bind
  • remove-trivy-action
  • remove-vendor-specific-parameters
  • revert-2646-fix-trainer-type-annotation
  • roadmap-2025
  • roadmap-2026
  • rqst-env-only-if-label-present
  • runtime-rbac
  • runtime_fix
  • safe-gpu-test
  • scorecard-workflow
  • script/setup-gpu-cluster2
  • sdk-ancestor-updates
  • sdk-fix-mpirun
  • security-doc
  • separate-models-from-sdk
  • sharedinit
  • solanyn/question-answer-example
  • support-arm-container
  • support-for-gpu-cluster-using-oci-runner
  • support_kai
  • terrytangyuan-patch-1
  • test-cncf-gpu-runner
  • test-gpu-arc
  • test-statusserver-helpers
  • test/fix-flaky-test
  • test/rename-runtime-plugin-tests
  • test/runtime-core-coverage
  • tmp_secret_verify
  • torchrun-var
  • trainer-release
  • training-progression#2779
  • trainjob-progress
  • treat-ancestor-label-as-identifier
  • trivy-scans
  • ttl
  • update-approvers
  • update-examples-with-unpacking-params
  • update-github-runners
  • update-image-tags
  • update-jobset-0.11
  • update-license
  • update-logs-examples
  • update-manifest-images-to-ghcr
  • update-news-v2.2-release
  • update-owners
  • update-readme
  • update-release-process
  • update-sdk-reference
  • update-security-context
  • update-slack
  • update-stale-bot-version
  • update-torch-2.10
  • update-torch-2.9
  • use-tilt
  • validation-mpiruntimes
  • volcano
  • volcano-podgroup-build
  • vuls
  • vzamboulingame-upgrade-go-v1.24
  • was-kep
  • workflow/helm
  • workflow/publish-helm-charts
  • xgboost-runtime-implementation
  • year-cleanup

03 Apr 2026 08:20PM UTC coverage: 57.792% (-0.3%) from 58.057%
23960834318

Pull #3408

github

krishdef7
feat(operator): support multi-slice TPU training via trainer replicas

For multi-slice TPU, JobSet models each TPU slice as a ReplicatedJob
replica, with parallelism = hosts per slice and replicas = slice count.
The operator previously blocked this with two hard constraints:

1. builder.go unconditionally set trainer Replicas = 1, destroying any
   value from the runtime template.
2. trainingruntime_webhook.go rejected replicas != 1 for all ancestors
   including trainer.

Changes:
- builder.go: nil-guard for trainer Replicas, preserving the value from
  the runtime template instead of unconditional overwrite.
- jobset.go: in Build(), compute perSlice = numNodes / replicas for the
  trainer ancestor so each slice runs the correct number of hosts.
- trainingruntime_webhook.go: allow trainer ancestor replicas > 1 to
  enable multi-slice configurations to pass admission.
- trainingruntime_webhook_test.go: update invalid_replicas test to
  reflect that trainer replicas > 1 is now valid.
- trainingruntime_test.go: add test case for 4-slice x 8 hosts
  (NumNodes=32), verifying Parallelism=8 per slice and MinMember=34.

Semantics: numNodes = total hosts across all slices.
Per-slice hosts = numNodes / replicas.

REF: https://github.com/kubeflow/trainer/issues/3407
Signed-off-by: krishdef7 <gargkrish06@gmail.com>
Pull Request #3408: feat(operator): support multi-slice TPU by enabling trainer replicas > 1

6 of 32 new or added lines in 4 files covered. (18.75%)

2036 of 3523 relevant lines covered (57.79%)

0.67 hits per line

Relevant lines Covered
Build:
Build:
3523 RELEVANT LINES 2036 COVERED LINES
0.67 HITS PER LINE
Source Files on master
  • Tree
  • List 40
  • Changed 3
  • Source Changed 0
  • Coverage Changed 3
Coverage ∆ File Lines Relevant Covered Missed Hits/Line

Recent builds

Builds Branch Commit Type Ran Committer Via Coverage
23960834318 feat/trainer-multi-slice-tpu feat(operator): support multi-slice TPU training via trainer replicas For multi-slice TPU, JobSet models each TPU slice as a ReplicatedJob replica, with parallelism = hosts per slice and replicas = slice count. The operator previously blocked thi... Pull #3408 03 Apr 2026 08:24PM UTC krishdef7 github
57.79
23960692171 feat/trainer-multi-slice-tpu feat(operator): support multi-slice TPU training via trainer replicas For multi-slice TPU, JobSet models each TPU slice as a ReplicatedJob replica, with parallelism = hosts per slice and replicas = slice count. The operator previously blocked thi... Pull #3408 03 Apr 2026 08:19PM UTC krishdef7 github
57.89
23811025952 test-statusserver-helpers fix(statusserver): improve bearer token parsing and add helper tests Signed-off-by: Skolli <tanusuch@gmail.com> Pull #3405 03 Apr 2026 06:24PM UTC suchirkolli github
58.34
23957046905 dependabot/docker/cmd/trainers/torchtune/pytorch/pytorch-2.11.0-cuda12.8-cudnn9-runtime chore(deps): bump pytorch/pytorch in /cmd/trainers/torchtune Bumps pytorch/pytorch from 2.9.1-cuda12.8-cudnn9-runtime to 2.11.0-cuda12.8-cudnn9-runtime. --- updated-dependencies: - dependency-name: pytorch/pytorch dependency-version: 2.11.0-cu... Pull #3381 03 Apr 2026 06:23PM UTC web-flow github
58.06
23957029240 dependabot/docker/cmd/runtimes/deepspeed/nvidia/cuda-13.2.0-devel-ubuntu22.04 chore(deps): bump nvidia/cuda in /cmd/runtimes/deepspeed Bumps nvidia/cuda from 13.1.1-devel-ubuntu22.04 to 13.2.0-devel-ubuntu22.04. --- updated-dependencies: - dependency-name: nvidia/cuda dependency-version: 13.2.0-devel-ubuntu22.04 depen... Pull #3380 03 Apr 2026 06:22PM UTC web-flow github
58.14
23956912610 dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.9 chore(deps): bump deepspeed in /cmd/runtimes/deepspeed Bumps [deepspeed](https://github.com/deepspeedai/DeepSpeed) from 0.18.7 to 0.18.9. - [Release notes](https://github.com/deepspeedai/DeepSpeed/releases) - [Commits](https://github.com/deepspee... Pull #3402 03 Apr 2026 06:19PM UTC web-flow github
58.06
23956913849 dependabot/pip/cmd/runtimes/deepspeed/deepspeed-0.18.9 chore(deps): bump deepspeed in /cmd/runtimes/deepspeed Bumps [deepspeed](https://github.com/deepspeedai/DeepSpeed) from 0.18.7 to 0.18.9. - [Release notes](https://github.com/deepspeedai/DeepSpeed/releases) - [Commits](https://github.com/deepspee... Pull #3402 03 Apr 2026 06:18PM UTC web-flow github
58.06
23956914515 dependabot/pip/cmd/runtimes/deepspeed/datasets-4.8.4 chore(deps): bump datasets in /cmd/runtimes/deepspeed Bumps [datasets](https://github.com/huggingface/datasets) from 4.7.0 to 4.8.4. - [Release notes](https://github.com/huggingface/datasets/releases) - [Commits](https://github.com/huggingface/da... Pull #3384 03 Apr 2026 06:18PM UTC web-flow github
58.06
23956910435 master chore(deps): bump github.com/go-jose/go-jose/v4 from 4.1.3 to 4.1.4 (#3406) Bumps [github.com/go-jose/go-jose/v4](https://github.com/go-jose/go-jose) from 4.1.3 to 4.1.4. - [Release notes](https://github.com/go-jose/go-jose/releases) - [Commits](... push 03 Apr 2026 06:18PM UTC web-flow github
58.06
23956878291 master chore(deps): bump transformers from 5.3.0 to 5.4.0 in /cmd/runtimes/deepspeed (#3401) Bumps [transformers](https://github.com/huggingface/transformers) from 5.3.0 to 5.4.0. - [Release notes](https://github.com/huggingface/transformers/releases) -... push 03 Apr 2026 06:17PM UTC web-flow github
58.14
See All Builds (2687)

Badge your Repo: trainer

We detected this repo isn’t badged! Grab the embed code to the right, add it to your repo to show off your code coverage, and when the badge is live hit the refresh button to remove this message.

Could not find badge in README.

Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

Refresh
  • Settings
  • Repo on GitHub
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc