Kubeflow and ML Automation: Part 1

Kubeflow Logo

With the growing maturity of the machine learning (ML) ecosystem and the deeper integration of ML algorithms into production software, managing the development, testing, and deployment of ML models has become a complex task.


Training deep neural-network models in a cloud environment requires a highly customized system that links together different components, such as compute, storage, and networking, allowing you to manage and orchestrate an ML pipeline in a consistent way. To create a functional ML pipeline, ML practitioners need to be able to set up an ML development environment, provision and scale compute power for training their models, create the models’ API, serve the models, and manage their lifecycle. But handling these tasks manually is an error-prone and time-consuming process. Moreover, not all ML practitioners have the DevOps expertise required to go from development to production and manage the AI/ML pipeline.


Automation can help address many of these challenges and is an integral part of the MLOps methodology, which aims to streamline ML workflows throughout the application lifecycle. In general, automation can provide the following benefits for the ML workflow in a distributed compute environment:


  • Lifecycle management of ML training jobs, including automatic scaling
  • Composing, linking, and orchestrating different components of the ML pipeline
  • Ensuring ML jobs have high availability and fault tolerance via automatic health checking and recovery
  • Reproducibility of ML experiments and enabling iterative ML practices e.g. automatic retraining based on incoming data
  • Automatic provisioning of compute resources

MLOPS Pipeline (Image by Kaskada.com)

This blog post is Part 1 of the “Kubeflow and ML Automation” series, which describes how Kubeflow helps automate Machine Learning on Kubernetes.


In this post, we introduce readers to Kubeflow, an open-source Kubernetes-based tool for automating your ML workflow. We show how the Kubeflow Pipelines platform and Kubeflow components allow you to automate and manage different stages of the ML workflow, including data preparation, model experimentation, training, and deployment. We also discuss how Kubeflow can be used as a part of the cloud-based managed K8s service.

In Part 2, we’ll walk readers through some practical examples of using Kubeflow for model training, ML model optimization, serving, metadata retrieval and processing, and creating composable and reusable ML pipelines.

Why ML Automation?

The traditional process of ML research and development is based on multiple manual practices such as data pre-processing, model selection and testing, model optimization, and deployment. This process requires a lot of specialized knowledge in mathematics, statistics, and programming.


Although advanced ML practitioners are equipped with the knowledge necessary to develop ML models, the process of testing, deploying, and training ML models requires specialized expertise in compute and storage infrastructure, as well as networking, to serve and deploy the ML models. This complicates the process of deploying models to production and prevents many companies without specialized expertise in ML and DevOps from adopting AI/ML. Also, traditional ML processes lack reproducibility and repeatability and do not allow effective collaboration between different IT teams.

Automation, which is ubiquitous in computer programming and IT, is the natural solution to these ML challenges. It provides many benefits for companies seeking to adopt ML algorithms in their applications, including:

  • Faster TTM (time to market). ML automation allows you to streamline ML model training, testing, and deployment, which results in a faster transition from development to production.
  • Enabling MLOps. Automation helps integrate various components of the ML workflow into a coherent pipeline that can be easily upgraded, tested, maintained, and deployed. Integrating automated testing, model builds, and deployments into the ML workflow aligns ML processes with existing CI/CD tools and approaches.
  • Better collaboration. Reproducibility of automated ML experiments and automated metadata and artifact management leads to a clear understanding by different teams of the model development timeline. This ensures more efficient collaboration across teams and different projects.
  • Reduction of human error. Subtle human errors can lead to a drastic deterioration in ML model performance, which is hard to debug due to the complexity of neural architecture and the “black-box” nature of model layers and parameters. Automation helps reduce human errors in ML models that are due to manual practices.
  • Improved model accuracy and performance. AutoML algorithms can improve model accuracy and performance much faster than manual trial-and-error tuning, plus achieve better performance with real-world production data.


ML automation provides numerous benefits for ML researchers and practitioners as well. Thanks to automated MLOps processes and the AutoML algorithms, they can develop, train, and optimize their model faster by focusing on the research part of their work, such as experimentation, and not having to worry as much about provisioning compute resources, implementing distributed training, and configuring training environments.

Why Kubeflow Is the Answer to the Challenge of Automation?

Kubeflow is an open-source ML platform designed to train and serve ML models on Kubernetes. The main purpose of Kubeflow is to enable MLOps — a methodology for the end-to-end management of ML workflows that facilitates fast model development, training, and rollout/upgrade cycles. To achieve this, Kubeflow leverages Kubernetes API resources and provides a set of tools to automate various stages of the ML workflow, from development and testing to deployment.

Kubeflow allows you to automate ML workflows in many different ways. Let’s discuss the most important of them.

Automated Containerization of ML Code

Training ML models in a distributed containerized environment like Kubernetes requires packaging ML code into containers configured with all the necessary executables, libraries, configuration files, and other data. Packaging containers whenever the model is updated takes time and may require a complex CI/CD pipeline. Kubeflow Fairing is a Kubeflow component that can automatically build container images from Jupyter Notebooks and Python files. Using Kubeflow Fairing, practitioners with less experience in Docker can easily run containers on Kubernetes.

Automation of the Application Lifecycle in Kubernetes

Distributed environments like Kubernetes clusters are highly volatile and dynamic, which makes it hard to ensure the uninterrupted operation of ML applications and training jobs. With hundreds of containers and pods running in one cluster, manual redeployment of failed applications and pods can be difficult, error-prone, and time-consuming.

Kufeblow leverages Kubernetes Control Plane and its operators to manage the lifecycle of training jobs. Kubeflow controllers such as TFOperator or PyTorch Operator interact with Kubernetes controllers and schedulers to perform automated health checks to ensure that the ML model is running as expected. In its turn, Kubernetes automatically restarts pods on failure and maintains the desired number of replicas in Deployments, StatefulSets, and Kubeflow custom API resources.

Autoscaling with Kubeflow

Autoscaling ensures the high availability of served ML models. The failure to dynamically scale ML inference servers based on inbound user requests can lead to downtimes and poor ML user experience. Kubeflow provides several tools for autoscaling served models on Kubernetes.

For example, autoscaling is supported for ML models deployed with the KFServing framework, which is a part of the Kubeflow default installation. Under the hood, KFServing uses Knative to autoscale deployments based on the average number of incoming requests per pod. If the concurrency target is set to three, for example, hundred concurrent requests will result in the spinning up of two new inference pods. The concurrency target can be easily set using the KFServing InferenceService custom resource. Similarly, flexible autoscaling is supported by Seldon Core, which is not part of Kubeflow but integrates well with it.

Automated Distributed Training

Distributed training is required for fast and scalable training of large ML models. It’s a natural solution in a distributed environment like Kubernetes where multiple nodes and CPUs/GPUs are available on-demand. However, it may be hard to manually configure the interaction and coordination between ML training jobs and workers, including the exchange and update of weights, computation of aggregate losses, etc.


Kubeflow ships with several tools that allow you to automate the distributed training of ML models on Kubernetes. For example, TFJob can be used to configure distributed training for TensorFlow models. The TFOperator that manages these training jobs provides three abstractions that represent different agents in the distributed training: Chiefs are responsible for the training orchestration and model checkpointing, Parameter Servers perform weight updates and model loss calculations, and Workers run the training code. These abstractions can be used to implement both synchronous and asynchronous distributed training patterns supported by the TF distributed training modules.

To enable distributed training in your Kubeflow cluster, you can also use MPI Operator, which allows for allreduce-style distributed training of your ML models on Kubernetes. MPI Operator is the Kubeflow implementation of the Message Passing Interface (MPI), a multi-network protocol for efficient communication and coordination of nodes in compute clusters.

Automation of Model Optimization

ML model optimization is an important component of the ML workflow and aims to improve the performance and accuracy of trained ML models. Methods such as hyperparameter optimization and model architecture search have traditionally relied on repetitive trial and error experiments that take much time and are hard to test and reproduce. AutoML algorithms were developed to enable faster optimization of ML models for better accuracy and performance.

The Kubeflow Katib component provides AutoML tools for hyperparameter optimization and neural architecture search (NAS). Instead of testing hyperparameters manually, ML developers can use Katib to define the hyperparameter search space and let Katib perform the search using a specified AutoML algorithm. Katib supports algorithms such as Bayesian optimization, Tree of Parzen Estimators, and Random Search, among others.

This optimization can be performed on several hyperparameters in parallel. For example, you can simultaneously optimize the learning rate and regularization parameter. Also, Katib supports early-stopping algorithms to prevent model overfitting and NAS to select the optimal combination of a neural network’s layers and modules.

Automation of ML Metadata and Log Management

The conventional metadata and log management process used by ML practitioners typically involves the following steps:

  • Generating artifacts (graphs, tables), logs, and metrics
  • Watching them while the training job is running
  • Recording the most important metrics for model performance analysis


If used repeatedly over a prolonged period of time, this manual process can lead to losing track of past ML experiments. It’s difficult to monitor the history of the model and collect insights from the disparate and unmanaged logs and metadata generated by different experiments.

Kubeflow Metadata is a tool that helps you automate ML metadata management and address the above challenges. It ships with the Python SDK that lets ML practitioners record metadata directly from the ML model scripts. The SDK provides a number of useful functions to retrieve and organize the metadata on model training, datasets, metrics, and experiment runs and ship this data to Kubeflow Artifact Store for use by other components. Automatic retrieval of the metadata via Kubeflow enables better observability of ML experiments and lets you audit past experiments, performance metrics, datasets, and frameworks. A better understanding of past experiments enabled by Metadata can result in faster ML development cycles and better coordination across ML teams.

Automating ML Workflow with Kubeflow Pipelines

The Kubeflow automation tools we’ve discussed until now help automate specific parts of the ML workflow, such as training or hyperparameter optimization. The goal of Kubeflow Pipelines is to automate the entire ML workflow and transform it into a composable ML Pipeline.

Kubeflow Pipelines consist of multiple components that represent various stages of the ML process in the form of a graph. For example, the pipeline can start with the data pre-processing job that consumes data from cloud storage and passes it to the training job. In turn, a training job component can be connected to the serving module that saves the trained model and launches the inference service to expose it to the Internet. This pipeline can be created using Pipeline DSL, saved as a separate file, and viewed as a graph in the Kubeflow dashboard.

Organizing the ML pipeline as a set of independent and connected components provides enormous benefits for AI/ML teams enabling the ML process in line with the MLOps methodology. Such an approach allows for modularity of the ML pipeline, similar to a microservices architecture. Each component of the pipeline can be developed, tested, and upgraded independently by different teams. This modularity leads to better reusability of ML components, as now, each of them can be used as components of other ML pipelines. Since components are abstracted and isolated from the underlying environment, they can be used in any pipeline that follows the same approach.

Also, Kubeflow pipelines enable easy repeatability and reproducibility of ML experiments. ML practitioners can run Kubeflow Pipelines knowing the execution order of different scripts and components, leading to improved coordination across teams and a better understanding of the ML workflow.

Advantages of Using Kubeflow in the Cloud

Running Kubeflow in the cloud provides additional benefits for companies seeking to automate their ML workflow and make it more efficient. Cloud providers like Google Cloud offer managed Kubernetes services and cloud-based ML platforms that can be integrated with the Kubeflow installation. Most importantly, when running Kubeflow in the cloud, you get access to highly performant on-demand CPUs/GPUs and storage.

Companies running Kubeflow in the cloud can take advantages of mature ecosystems of cloud tools including Big Data tools and databases, block storage, serverless cloud functions, monitoring, logging, tracing, auditing, security and add-on ML services that can be integrated into your own ML process as helpers or part of the pipeline.

For example, when running Kubeflow in Google Cloud, you can have access to Google Tensor Processing Units (TPUs), specialized hardware optimized for parameter computations and weights — updates typically made by the neural networks.


Kubeflow is one of the first tools to bring full automation of AI/ML pipelines in containerized and distributed environments like Kubernetes. It leverages Kubernetes API resources and orchestration services to ensure high availability and fault tolerance of containerized ML applications and provides its own set of tools to automate various parts of the ML workflow. Kubeflow’s feature set enables automation of ML model training, hyperparameter optimization, feature engineering, data pre-processing, model serving, and ML model containerization.

In addition to that, Kubeflow Pipelines help create composable ML workflows in line with the MLOps methodology. Pipelines makes ML workflows composable, reproducible, and extendable, dramatically improving the quality of ML collaboration across teams and enabling faster experimentation and better observability of the model training.

Built-in automation offered by Kubeflow makes it a great tool for companies looking for a fast and efficient way to deploy ML models to production. The platform can dramatically reduce time to market for ML products and facilitate efficient CI/CD processes in line with MLOps methodology.

If you want to learn more about using Kubeflow for the automation of ML on Kubernetes, you can read Part 2 of our series, which provides many practical examples of leveraging Kubeflow components to create efficient ML workflows.