The increasing adoption of machine learning (ML) in production requires scalable, reliable, and automated deployment strategies. Enterprises need a way to serve models at scale, track experiments, and automate model selection. Kubernetes provides the infrastructure foundation, but the combination of MLflow, KServe, and AutoML elevates this to a fully managed, production-grade AI/ML inference platform.
This article explains how to set up scalable and reliable inference on Kubernetes using MLflow for experiment tracking and model registry, KServe for model serving, and AutoML for automated model building. We’ll also walk through code examples that tie these technologies together.
Why Kubernetes For AI/ML Inference
Kubernetes is the de facto standard for container orchestration. It offers:
-
Scalability: Horizontal Pod Autoscaling (HPA) automatically adds or removes pods based on demand.
-
Reliability: Built-in failover, health checks, and rolling updates ensure minimal downtime.
-
Portability: Kubernetes abstracts the underlying infrastructure, whether cloud or on-premises.
These benefits make Kubernetes ideal for serving ML models that may need to handle unpredictable workloads.
Key Components Of The Stack
Before diving into implementation, let’s clarify the roles of each tool:
-
MLflow: A platform for tracking ML experiments, logging metrics, and registering models in a central registry.
-
KServe: A Kubernetes-native model inference platform (formerly known as KFServing) that supports autoscaling, GPU acceleration, and multi-framework model serving.
-
AutoML: An approach or library (such as H2O AutoML, Auto-sklearn, or cloud-native AutoML solutions) that automatically searches for the best-performing models.
Combining these tools results in an end-to-end pipeline: AutoML discovers the best model, MLflow tracks and stores it, and KServe deploys it for scalable inference.
Setting Up The Kubernetes Cluster
You can use any Kubernetes cluster—local (via Minikube, kind, or k3s) or cloud-managed (GKE, EKS, AKS). For example, with Minikube:
minikube start --cpus=4 --memory=8192
Ensure kubectl
is configured:
kubectl get nodes
You should see at least one node in the Ready
state. Install Helm as well for package management:
brew install helm # or your OS equivalent
Installing MLflow For Experiment Tracking
MLflow can run inside Kubernetes as a deployment with a backend database and artifact store.
Deploy MLflow with PostgreSQL backend:
Create a mlflow-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow
spec:
replicas: 1
selector:
matchLabels:
app: mlflow
template:
metadata:
labels:
app: mlflow
spec:
containers:
- name: mlflow
image: cr.flyte.org/unionai/mlflow:latest
env:
- name: BACKEND_STORE_URI
value: postgresql://user:password@postgres:5432/mlflow
- name: ARTIFACT_ROOT
value: s3://your-s3-bucket
ports:
- containerPort: 5000
Apply it:
kubectl apply -f mlflow-deployment.yaml
Expose it via a Kubernetes service:
apiVersion: v1
kind: Service
metadata:
name: mlflow-service
spec:
selector:
app: mlflow
ports:
- port: 5000
targetPort: 5000
type: LoadBalancer
Now MLflow is accessible, ready to track experiments and store models.
Using AutoML To Build And Register Models
AutoML simplifies model selection and hyperparameter tuning. Here’s an example using Python’s auto-sklearn
to train and log a model to MLflow:
import autosklearn.classification
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_digits(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=300,
per_run_time_limit=30
)
automl.fit(X_train, y_train)
y_pred = automl.predict(X_test)acc = accuracy_score(y_test, y_pred)
mlflow.set_experiment(“digits-automl”)with mlflow.start_run():
mlflow.log_metric(“accuracy”, acc)
mlflow.sklearn.log_model(automl, “model”)
This code automatically selects the best model and registers it with MLflow. The model artifact is stored in the configured artifact store.
Deploying Models With KServe
KServe provides serverless inferencing on Kubernetes. It supports frameworks like TensorFlow, PyTorch, XGBoost, and even custom containers.
Install KServe via Helm:
helm repo add kserve https://kserve.github.io/helm-charts
helm install kserve kserve/kserve
Once installed, create an InferenceService YAML manifest to deploy a model tracked in MLflow.
Suppose MLflow stored a scikit-learn model in an S3 bucket. Your inference-service.yaml
might look like this:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: digits-model
spec:
predictor:
sklearn:
storageUri: s3://your-s3-bucket/mlflow/0/abcdef123456/artifacts/model
resources:
limits:
cpu: 1
memory: 2Gi
Apply it:
kubectl apply -f inference-service.yaml
KServe automatically creates an endpoint such as:
http://digits-model.default.example.com/v1/models/digits-model:predict
This endpoint scales up or down depending on traffic.
Autoscaling And Reliability
KServe integrates with Knative Serving and Kubernetes Horizontal Pod Autoscaler (HPA). This allows:
-
Automatic Scaling: Scale to zero when idle, scale up under heavy load.
-
Rolling Updates: Deploy new model versions without downtime.
-
Health Checks: Liveness and readiness probes restart unhealthy pods.
For example, to enable scaling between 1 and 10 replicas:
spec:
predictor:
minReplicas: 1
maxReplicas: 10
This ensures reliability even during traffic spikes.
Securing The Inference Endpoint
For production, secure the inference service with authentication and TLS:
-
Ingress Gateway: Use Istio or NGINX ingress with TLS certificates.
-
Authentication: Integrate with OAuth2 proxies or service meshes.
-
Network Policies: Restrict access to specific namespaces or IP ranges.
For example, an Istio gateway can route traffic securely:
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: kserve-gateway
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 443
name: https
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: kserve-cert
hosts:
- "ml.example.com"
Continuous Deployment With MLflow Model Registry
To update models seamlessly:
-
Promote a model version to “Production” in MLflow’s Model Registry.
-
Update the
storageUri
in the KServeInferenceService
manifest to point to the new model. -
Apply the manifest. KServe performs a rolling update with zero downtime.
This makes CI/CD pipelines easy to implement with tools like GitHub Actions or Argo CD.
Example: Querying The Model
Once deployed, you can send inference requests using curl
or Python.
Sample JSON Input:
{
"instances": [[0.0, 0.1, 0.2, 0.3, ...]]
}
cURL Command:
curl -v \
-H "Host: digits-model.default.example.com" \
-H "Content-Type: application/json" \
-d @input.json \
http://<INGRESS_IP>/v1/models/digits-model:predict
The service responds with prediction outputs in JSON.
Observability And Monitoring
A production inference pipeline needs visibility:
-
Logging: Use tools like Fluentd or Loki for centralized log aggregation.
-
Metrics: KServe exposes Prometheus metrics for request count, latency, and errors.
-
Tracing: Jaeger or OpenTelemetry can trace requests across microservices.
These observability practices ensure you can troubleshoot performance issues quickly.
Conclusion
Running scalable and reliable AI/ML inference on Kubernetes is no longer a complex undertaking when combining MLflow, KServe, and AutoML. AutoML streamlines model training and selection, MLflow provides robust experiment tracking and a model registry, and KServe enables serverless, autoscaled model serving with built-in support for multiple ML frameworks.
By integrating these tools:
-
Data scientists can focus on model development while leaving infrastructure concerns to Kubernetes.
-
Engineers can automate deployments and rollbacks through the MLflow registry and KServe’s declarative manifests.
-
Operations teams can maintain high availability and monitor performance using Kubernetes-native tools.
This architecture ensures that as demand grows, your inference services scale seamlessly while maintaining reliability and security. With the described setup and code examples, you can confidently move from prototype to production, delivering machine learning predictions at scale without sacrificing manageability or speed.