JobSet: a k8s native API for distributed ML training and HPC workloads
JobSet is a Kubernetes-native API for managing a group of k8s Jobs as a unit. It aims to offer a unified API for deploying HPC (e.g., MPI) and AI/ML training workloads (PyTorch, Jax, Tensorflow etc.) on Kubernetes.
Related contents:
HPCng is an open community of people and organizations interested in the broad modernization of HPC capabilities across a wide range of use-cases ranging from traditional HPC to enterprise and hyper-scale workloads.
Apptainer is an open source container platform designed to be simple, fast, and secure. Many container platforms are available, but Apptainer is designed for ease-of-use on shared systems and in high performance computing (HPC) environments.