MLOps in Practice: Building Machine Learning Systems That Last

The machine learning lifecycle has a seductive structure in its early stages. You gather data, build features, train a model, evaluate it, and point to a number — F1 score, AUC, RMSE, mean average precision — that looks good. You deploy it. Success.

Six months later, the model’s predictions have degraded. The data it receives in production no longer looks like the data it was trained on. A feature it relied on has changed semantics. The ground truth labels are now collected differently. The world has moved; the model has not.

This is the problem that MLOps — Machine Learning Operations — exists to address. Not the training of models, but the maintenance of systems that deliver reliable predictions continuously, in production, under changing real-world conditions.

The Fundamental Asymmetry

Software systems degrade when code changes. ML systems degrade even when nothing changes — because the world that provides their inputs is not static. This model drift is the defining challenge of production ML.

Two primary forms:

Data drift (covariate shift): The statistical distribution of input features changes. A fraud detection model trained on pre-pandemic transaction patterns encounters post-pandemic transaction patterns. The model hasn’t changed. The world it was trained on has. Predictions degrade.

Concept drift: The relationship between inputs and outputs changes. A model predicting house prices trained on 2020 data operates in the 2022 market with different price determinants. The features are the same; what they imply has changed.

Detecting these drifts before they cause business harm — and automating response to detected drift — is the core operational challenge of deployed ML.

A Production-Ready ML Stack on AWS

The AWS machine learning ecosystem offers a relatively complete stack for production MLOps. Understanding how the components fit together is the prerequisite for avoiding the common mistake of using powerful tools for problems they aren’t suited to.

SageMaker: The Operational Center

SageMaker is AWS’s managed ML platform. It spans the full lifecycle: data preparation, training, evaluation, deployment, and monitoring. For organizations already on AWS, it provides the tightest integration with the rest of the AWS ecosystem. Its principal components for MLOps:

SageMaker Pipelines orchestrates multi-step ML workflows — data preprocessing, model training, evaluation, conditional registration, deployment — as explicit directed acyclic graphs that are versioned, reproducible, and executable on demand or schedule. This is the foundation of automated retraining.

SageMaker Model Registry is a model catalog that maintains versioned model artifacts, their associated metadata (training metrics, data used, lineage), and approval status. Models progress through the registry: Pending Review → Approved → deployed. Every model in production has a registry entry. Every training run produces a candidate.

SageMaker Endpoints are the serving infrastructure. Endpoint configurations specify instance types, autoscaling policies, and data capture settings. Multi-variant endpoints allow A/B testing between model versions. Serverless inference provides automatic scaling to zero for intermittent workloads.

SageMaker Model Monitor provides continuous monitoring of production endpoints — comparing the statistical distribution of live inference requests against a baseline captured from training data, and raising alerts when distributions drift beyond thresholds.

The Model Monitor Setup

Setting up Model Monitor correctly requires discipline:

from sagemaker.model_monitor import DataCaptureConfig, DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

# 1. Enable data capture on the endpoint
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=20,  # Capture 20% of inference traffic
    destination_s3_uri=f"s3://{bucket}/model-monitor/data-capture",
    capture_options=["Request", "Response"]
)

# 2. Create a baseline from training data
monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

monitor.suggest_baseline(
    baseline_dataset=f"s3://{bucket}/train/baseline.csv",
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=f"s3://{bucket}/model-monitor/baseline",
)

# 3. Schedule monitoring
monitor.create_monitoring_schedule(
    monitor_schedule_name="fraud-detection-monitor",
    endpoint_input=predictor.endpoint_name,
    output_s3_uri=f"s3://{bucket}/model-monitor/reports",
    statistics=monitor.baseline_statistics(),
    constraints=monitor.suggested_constraints(),
    schedule_cron_expression="cron(0 * ? * * *)",  # Hourly
)

The monitor compares the distribution of each feature in recent inference traffic against the baseline. Violations — features outside expected ranges, new categories, changed distributions — trigger CloudWatch alarms that can initiate automated response workflows.

Feature Stores

Features are one of the most expensive and error-prone aspects of ML production. The same feature — “customer’s average purchase value in the last 30 days” — may be computed independently in training pipelines, batch inference pipelines, and real-time serving, with subtle differences that cause training-serving skew: the model receives different feature values in production than it was trained on.

SageMaker Feature Store addresses this by centralizing feature computation and storage. Features are written once and read by both training and serving pipelines, eliminating skew at the source.

Online store: Low-latency key-value storage for real-time feature retrieval. Millisecond reads. Used by real-time inference endpoints.

Offline store: S3-backed columnar storage for training data and historical analysis. Queryable via Athena or SageMaker Processing.

The investment in a feature store pays for itself the first time a training-serving skew incident is avoided. These incidents are difficult to diagnose precisely because the model works correctly on training data — the problem is in the production data pathway, which is harder to observe.

The Automated Retraining Pipeline

The response to detected drift is retraining. But retraining should not be manual. A mature MLOps practice has an automated pipeline that:

Detects drift in Monitor alerts (via CloudWatch)
Triggers a Lambda function or Step Functions workflow
Pulls fresh training data from the feature store or data lake
Runs a SageMaker Pipeline: preprocess → train → evaluate → conditional register
If evaluation metrics exceed thresholds, registers the new model in the registry
Initiates a deployment workflow (blue-green or canary)
Monitors the new deployment for regression
Completes rollout or rolls back based on performance

The conditional registration step is critical. Retraining does not guarantee improvement. The pipeline must compare new model metrics against the currently deployed model and register the new version only if it genuinely improves on the defined objectives.

# Conditional model registration in SageMaker Pipelines
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import JsonGet

cond_register = ConditionGreaterThanOrEqualTo(
    left=JsonGet(
        step_name=evaluation_step.name,
        property_file=evaluation_report,
        json_path="binary_classification_metrics.accuracy.value",
    ),
    right=0.90,  # Only register if accuracy >= 90%
)

register_step = RegisterModel(
    name="RegisterFraudModel",
    estimator=xgb_estimator,
    model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
    content_types=["text/csv"],
    response_types=["text/csv"],
    inference_instances=["ml.m5.xlarge"],
    transform_instances=["ml.m5.xlarge"],
    model_package_group_name="FraudDetectionModels",
)

condition_step = ConditionStep(
    name="CheckAccuracy",
    conditions=[cond_register],
    if_steps=[register_step],
    else_steps=[],
)

Deployment Strategies

Moving a new model into production carries risk. Even a model that outperforms the current version on holdout data may behave unexpectedly on live traffic. Several deployment strategies mitigate this.

Shadow mode deployment: The new model receives a copy of production traffic and generates predictions, but those predictions are not served to users. Only the existing model’s predictions count. Monitor the new model’s metrics against the production model’s for a period before committing to the transition.

Canary deployment: Route a small percentage of traffic (5-10%) to the new model. Compare outcomes between the old and new model on this traffic split. Gradually increase the new model’s traffic share as confidence grows.

Blue-green deployment: Maintain two fully provisioned endpoints — the current (blue) and the new (green). Switch traffic at the load balancer level. The old endpoint remains live for rapid rollback if needed.

SageMaker’s endpoint update mechanism supports blue-green deployment natively, and canary weights can be applied via multi-variant endpoints.

What Good Looks Like

A mature MLOps practice is not just tooling — it is culture and discipline.

Every model in production has an owner. Not just the team that trained it; a specific person who is responsible for monitoring its performance and accountable when it degrades.

Every deployment has a rollback plan. The previous model version is pinned and available. The rollback procedure is documented and tested. The decision criteria for initiating a rollback are explicit.

Drift detection operates on a schedule. Not manually, not reactively after users complain, but continuously, with defined thresholds that trigger automatic alerts.

Model performance is a tracked business metric. Not just technical metrics in a monitoring dashboard, but business outcomes — fraud caught / not caught, recommendations clicked / not clicked, churn predicted / not prevented — tracked alongside the P&L consequences.

Retroactive model evaluation. For models predicting future events (churn, fraud, demand), labels arrive after predictions. Ground truth evaluation pipelines continuously evaluate historical predictions against realized outcomes, providing ongoing production accuracy estimates.

The gap between “we have a model in production” and “we have a reliable ML system” is large and systematically underestimated. MLOps is not the overhead you add after the model works. It is the practice that makes the model reliably work — over months and years, as data drifts and the world moves and the business changes its questions.

The model is not the product. The reliable delivery of predictions is the product. That delivery requires everything described here.