Service Foundry
Young Gyu Kim <credemol@gmail.com>

KServe Transformer: Bridging Application Payloads and Model Inference Protocols

kserve transformer introduction

Overview

In a production ML serving environment, the data format expected by a trained model rarely matches the format that client applications naturally produce. Models trained with scikit-learn or PyTorch typically require a strict, schema-enforced payload — such as the KServe V2 Open Inference Protocol — while client applications may produce simpler, more expressive JSON objects. Forcing clients to conform to the V2 protocol leaks infrastructure concerns into application code, creating tight coupling and brittleness.

KServe addresses this problem with the Transformer component: a lightweight sidecar service that sits between the client and the model predictor. The Transformer intercepts every request and response, allowing developers to centralize all data marshalling, feature engineering, and business logic in one place — completely independent of the model itself.

This article walks through the full lifecycle of building a custom KServe Transformer for an Iris flower classification service. You will learn how the Transformer integrates with an InferenceService, how to implement the preprocess() and postprocess() hooks, how to package the Transformer as a Docker image, and how to connect a Streamlit client application to the end-to-end pipeline.

Prerequisites

Before following this guide, ensure the following are in place:

  • Kubernetes cluster with KServe installed — A functioning cluster running KServe v0.13 or later. A previous article in this series covers cluster setup and KServe installation.

  • kubectl configured — The kubectl command-line tool must be installed and pointed at your cluster via a valid kubeconfig.

  • Docker — Required for building and pushing the custom Transformer container image.

  • Container registry access — A registry such as Docker Hub, Amazon ECR, or Google Artifact Registry where you can push the Transformer image. The cluster must be able to pull from this registry.

  • Python 3.11+ — The Transformer is written in Python. A local Python environment is needed for development and testing.

  • Familiarity with the Iris Classifier deployment — This article builds on the Iris Classifier InferenceService introduced in the previous article. The predictor component is reused here unchanged.

Transformer

What Is a Transformer in KServe?

A Transformer is a specialized KServe component that acts as a protocol and data adapter between clients and model predictors. Architecturally, it is deployed as an additional container within the same InferenceService pod, and KServe automatically routes all inbound requests through it before they reach the model.

The Transformer exposes the same HTTP endpoints as the predictor, making it completely transparent to the client. Internally, every request passes through two developer-defined hooks:

  • preprocess() — Receives the raw client request and transforms it into the format the model predictor expects.

  • postprocess() — Receives the model’s raw response and transforms it into the format the client expects.

This design cleanly separates two distinct concerns: ML inference (owned by the model) and data adaptation (owned by the Transformer). The model image remains a pure ML artifact, while all protocol translation, feature engineering, and business logic live in the Transformer.

Why Use a Transformer?

Without a Transformer, every client must independently construct payloads that conform to the model server’s protocol. This creates tight coupling: the moment the model’s expected format changes, every client must be updated. It also forces client developers to understand ML infrastructure details that are irrelevant to their domain.

A Transformer solves this by acting as a stable contract boundary. The client sends data in whatever format is natural for the application; the Transformer handles all translation internally. This enables the following patterns:

  • Format Conversion — Translate application-specific JSON structures into the V2 Open Inference Protocol format required by the predictor, and back again.

  • Feature Engineering — Compute derived features on the fly, such as ratios, log-transforms, or bucketed values, without modifying the model or the client.

  • Data Enrichment — Augment sparse requests with additional context retrieved from a feature store, database, or cache before forwarding them to the model.

  • Output Processing — Convert raw numeric predictions (e.g., class indices) into human-readable labels, confidence scores, or structured business objects.

  • Security and Compliance — Scrub personally identifiable information (PII), enforce authentication tokens, or validate schema before requests reach the model.

V1 and V2 Inference Endpoints

KServe exposes two HTTP endpoint families, each with different validation semantics. Understanding the distinction is important when choosing where to bind your Transformer.

Endpoint Validation Passed to preprocess()

/v2/models/{name}/infer

Strict — body must conform to the Open Inference Protocol InferenceRequest schema; FastAPI validates it before preprocess() is called.

InferenceRequest object (already validated)

/v1/models/{name}:predict

None — raw JSON is accepted as-is with no schema enforcement.

Raw Python dict

/v2/models/{name}/infer

This endpoint implements the Open Inference Protocol (OIP) — a vendor-neutral standard developed collaboratively by NVIDIA, Microsoft, and the KServe project. Strict schema enforcement ensures that any V2-compatible client (NVIDIA Triton, Seldon Core, BentoML, MLflow serving, etc.) can communicate with any V2-compatible server without custom adapters. Bypassing validation on this path is intentional by design, because doing so would break cross-system compatibility.

Use this endpoint when your clients already produce V2-compliant payloads, or when interoperability with other inference platforms is a requirement.

/v1/models/{name}:predict

This endpoint accepts any well-formed JSON object with no schema validation. The raw dictionary is passed directly to preprocess(), giving you complete flexibility over the input format.

Use this endpoint when you control the client and want to send a simpler, more expressive payload without conforming to the V2 schema. This is the pattern used throughout this article.

Implementation: Format Conversion Using a Transformer

In the previous article, we deployed an Iris Classifier InferenceService and a Streamlit client application. The client constructed a V2-compliant InferenceRequest payload manually before sending it to the predictor — coupling the client to the model’s protocol.

In this article, we introduce a Transformer that takes over that responsibility. The client now sends a compact, field-named JSON object, and the Transformer handles all conversion to and from the V2 protocol transparently.

The request and response payloads are defined as follows:

Request fields (client → Transformer):

sepal_length

Sepal length in centimetres (float)

sepal_width

Sepal width in centimetres (float)

petal_length

Petal length in centimetres (float)

petal_width

Petal width in centimetres (float)

Response fields (Transformer → client):

predictions

A list of predicted species names (strings)

The complete data flow through the pipeline is illustrated below.

graph TD
    A[Client] -->|JSON App Payload| B[Transformer /v1/models/iris-classifier:predict ]
    B -->|V2 InferenceRequest| C[Model Server /v2/models/iris-classifier/infer]
    C -->|V2 InferenceResponse| D[Transformer /v1/models/iris-classifier:predict]
    D -->|JSON App Payload| E[Client]

Transformer Input (Custom JSON)

The client sends a simple, field-named JSON object. The field names correspond directly to the four botanical measurements of the Iris flower, making the payload self-documenting and easy to construct without any knowledge of ML inference protocols.

{
  "sepal_length": 5.1,
  "sepal_width": 3.5,
  "petal_length": 1.4,
  "petal_width": 0.2
}

This format is intentionally human-readable and decoupled from the predictor’s internal requirements. The Transformer’s preprocess() method converts it into a V2-compliant request before forwarding it to the model.

Model Input (V2 Inference Request)

The Transformer converts the application payload into the format required by the KServe V2 Open Inference Protocol. The key structural differences are:

  • All feature values are packed into a flat array and wrapped inside an inputs tensor descriptor.

  • The tensor descriptor declares its shape ([1, 4] — one sample with four features), datatype (FP32 — 32-bit floating point), and a tensor name (input-0).

{
  "inputs": [
    {
      "name": "input-0",
      "shape": [1, 4],
      "datatype": "FP32",
      "data": [
        [5.1, 3.5, 1.4, 0.2]
      ]
    }
  ]
}

This is the payload that the Transformer’s predict() method forwards to the predictor over the internal cluster network.

Model Output (V2 Inference Response)

The predictor returns a V2-compliant response containing the predicted class index — an integer identifying which Iris species the model selected. The raw index is not meaningful to end users without a lookup table, which is exactly the gap the postprocess() method fills.

{
  "model_name": "iris-classifier",
  "outputs": [
    {
      "name": "output-0",
      "datatype": "INT64",
      "shape": [1],
      "data": [0]
    }
  ]
}

The data array contains one element — the integer class index. The mapping from index to species name is:

Index Species Name

0

Iris-Setosa

1

Iris-Versicolor

2

Iris-Virginica

Transformer Output (Application Response)

After postprocess() resolves the index to a species name, the Transformer returns a clean, friendly JSON response to the client. This format is stable regardless of how the model’s internal output representation evolves.

{
  "predictions": [
    "Iris-Setosa"
  ]
}

iris-transformer.py

The Transformer is implemented as a Python class that extends KServe’s Model base class. The three core lifecycle methods — preprocess(), predict(), and postprocess() — define the complete data transformation pipeline.

services/kserve-transformer/iris-transformer.py
import logging
from typing import Dict, Any

import httpx
import kserve
from kserve import Model, ModelServer

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__) (1)

# Map class index -> species name
IRIS_CLASSES = { (2)
    0: "Iris-Setosa",
    1: "Iris-Versicolor",
    2: "Iris-Virginica",
}


class IrisTransformer(Model): (3)
    """KServe Transformer for the Iris classifier.

    Pre-processing  : Converts a simple dict of flower measurements into the
                      V2 KServe inference protocol format expected by the model.
    Post-processing : Converts the model's INT64 class-index output back into
                      a human-readable species name.

    The `predict()` method is overridden to call the predictor directly over
    HTTP so that we are not dependent on the `PredictorConfig` context variable,
    which is not propagated to uvicorn worker subprocesses.
    """

    def __init__(self, name: str, predictor_host: str, protocol: str = "v2"): (4)
        super().__init__(name)
        self.predictor_host = predictor_host
        self.protocol = protocol
        self.ready = True (5)

    # ------------------------------------------------------------------
    # Pre-process: simple JSON  →  V2 inference request
    # ------------------------------------------------------------------
    def preprocess(self, inputs: Dict[str, Any], headers: Dict[str, str] = None) -> Dict[str, Any]: (6)
        """Transform incoming user-friendly input into the V2 inference protocol.

        Expected input:
            {
                "sepal_length": 5.1,
                "sepal_width":  3.5,
                "petal_length": 1.4,
                "petal_width":  0.2
            }

        V2 output sent to the predictor:
            {
                "inputs": [{
                    "name": "input-0",
                    "shape": [1, 4],
                    "datatype": "FP32",
                    "data": [[5.1, 3.5, 1.4, 0.2]]
                }]
            }
        """
        logger.info("preprocess input: %s", inputs)

        sepal_length = float(inputs["sepal_length"]) (7)
        sepal_width  = float(inputs["sepal_width"])
        petal_length = float(inputs["petal_length"])
        petal_width  = float(inputs["petal_width"])

        v2_request = {
            "inputs": [
                {
                    "name": "input-0",  (8)
                    "shape": [1, 4],    (9)
                    "datatype": "FP32", (10)
                    "data": [[sepal_length, sepal_width, petal_length, petal_width]], (11)
                }
            ]
        }

        logger.info("preprocess output (V2 request): %s", v2_request)
        return v2_request

    # ------------------------------------------------------------------
    # predict: forward V2 request to the predictor over HTTP directly
    # ------------------------------------------------------------------
    async def predict(self, payload: Dict[str, Any], headers: Dict[str, str] = None, response_headers: Dict[str, str] = None) -> Dict[str, Any]: (12)
        """Forward the pre-processed V2 request to the predictor.

        We override `predict()` to call the predictor directly via httpx.
        This avoids the `PredictorConfig` context variable, which is not
        propagated to uvicorn worker subprocesses in KServe 0.16.
        """
        predictor_url = f"http://{self.predictor_host}/v2/models/{self.name}/infer" (13)
        logger.info("Forwarding V2 request to predictor: %s", predictor_url)

        async with httpx.AsyncClient() as client: (14)
            response = await client.post(
                predictor_url,
                json=payload,
                headers={"Content-Type": "application/json"},
                timeout=60.0, (15)
            )
            response.raise_for_status() (16)
            result = response.json()

        logger.info("Predictor response: %s", result)
        return result

    # ------------------------------------------------------------------
    # Post-process: V2 inference response  →  friendly JSON
    # ------------------------------------------------------------------
    def postprocess(self, response: Dict[str, Any], headers: Dict[str, str] = None) -> Dict[str, Any]: (17)
        """Convert the V2 model response into a user-friendly prediction dict.

        Model output example:
            {
                "model_name": "iris-classifier",
                "outputs": [{
                    "name": "output-0",
                    "datatype": "INT64",
                    "shape": [1],
                    "data": [0]
                }]
            }

        Transformer output:
            {"predictions": ["Iris-Setosa"]}
        """
        logger.info("postprocess input (V2 response): %s", response)

        outputs = response.get("outputs", [])
        if not outputs: (18)
            logger.warning("No outputs found in model response")
            return {"predictions": []}

        data = outputs[0].get("data", []) (19)
        predictions = []
        for class_index in data:
            label = IRIS_CLASSES.get(int(class_index), f"Unknown({class_index})") (20)
            predictions.append(label)

        result = {"predictions": predictions}
        logger.info("postprocess output: %s", result)
        return result


if __name__ == "__main__":
    # kserve.model_server.parser already defines:
    #   --model_name, --predictor_host, --predictor_protocol, etc.
    # Adding them again causes argparse.ArgumentError: conflicting option string.
    args, _ = kserve.model_server.parser.parse_known_args() (21)

    transformer = IrisTransformer(
        name=args.model_name,
        predictor_host=args.predictor_host,
        protocol=args.predictor_protocol,
    )

    server = ModelServer()
    server.start(models=[transformer]) (22)
1 Structured logging is configured at module level. Using name as the logger name scopes log lines to this module, which simplifies filtering in aggregated log pipelines such as Loki or CloudWatch Logs.
2 IRIS_CLASSES is a module-level constant that maps the model’s integer output indices to their corresponding species names. Keeping this mapping outside the class avoids recreating it on every request.
3 IrisTransformer extends KServe’s Model base class, which provides the HTTP server infrastructure, endpoint routing, and lifecycle hooks. Subclassing Model is the standard extension mechanism for custom Transformers and Predictors alike.
4 The constructor accepts the model name, the predictor_host (the internal cluster hostname of the model server), and an optional protocol string. KServe injects predictor_host automatically at runtime as a command-line argument.
5 Setting self.ready = True immediately signals to KServe’s readiness probe that this Transformer is healthy and ready to accept traffic. In more complex scenarios, you might set this to False initially and flip it to True only after loading external resources (e.g., embeddings or a feature store client).
6 preprocess() is the first hook in the request pipeline. It receives the client’s raw payload (a Python dict when using the V1 endpoint) and must return a payload in the format the predictor expects.
7 Each measurement is explicitly cast to float. This guards against clients sending integer values or string-encoded numbers, both of which would cause a type mismatch when the model server validates the tensor datatype.
8 name is the tensor name declared in the model’s input signature. The scikit-learn server expects a tensor named input-0. This value must match what the model was trained with.
9 shape declares the tensor dimensions: [1, 4] means one sample (batch size of 1) with four features. The model server uses this to reshape the flat data array into the correct matrix.
10 datatype specifies the numeric precision. FP32 (32-bit floating point) matches the dtype the scikit-learn model expects. Using the wrong datatype here would cause a server-side validation error.
11 The four scalar values are packed into a nested list to represent a single row in a 2-D batch tensor. The outer list corresponds to the batch dimension; the inner list contains the four feature values.
12 predict() is declared async because it performs a non-blocking HTTP call to the predictor. KServe’s server is built on an async framework (uvicorn + asyncio), so using async I/O here avoids blocking the event loop during network latency.
13 The predictor URL is constructed from self.predictor_host and self.name, both of which are provided by KServe at runtime. The Transformer always calls the predictor’s V2 endpoint regardless of which endpoint the client used.
14 httpx.AsyncClient is used instead of the synchronous requests library to perform a non-blocking HTTP POST. The async with context manager ensures the connection is properly closed after the request completes.
15 A 60-second timeout is set to prevent the Transformer from hanging indefinitely if the predictor becomes unresponsive. Adjust this value based on your model’s expected inference latency.
16 raise_for_status() converts any 4xx or 5xx HTTP response from the predictor into a Python exception, which KServe will propagate back to the client as an appropriate error response.
17 postprocess() is the final hook in the response pipeline. It receives the predictor’s raw V2 response and must return the payload that will be sent back to the client.
18 Defensive check for an empty outputs array. This situation can occur if the predictor returned an unexpected response structure, and surfacing it as an empty list is safer than raising an unhandled IndexError.
19 The first output tensor’s data array is extracted. For the Iris model, this is a flat list of integer class indices — one per sample in the batch.
20 Each integer index is resolved to its species name using the IRIS_CLASSES lookup table. The fallback string f"Unknown({class_index})" preserves observability if the model ever returns an out-of-range index.
21 kserve.model_server.parser is KServe’s shared argument parser, which already defines --model_name, --predictor_host, and --predictor_protocol as standard flags. Using parse_known_args() rather than parse_args() allows the entry point to accept these flags without re-declaring them, which would cause an ArgumentError.
22 ModelServer().start() launches the uvicorn HTTP server and registers the Transformer instance as the handler for all inference requests. KServe’s routing layer then dispatches each request to the appropriate lifecycle method.

Deploy Transformer

Directory Structure

The Transformer service lives in its own directory alongside the InferenceService manifest and test utilities. Keeping it self-contained makes it straightforward to version, build, and deploy independently of the rest of the application.

Directory Structure
$ tree services/kserve-transformer/
services/kserve-transformer/
├── Dockerfile
├── README.md
├── iris-inference-with-transformer.json
├── iris-inference-with-transformer.yaml
├── iris-transformer.py
└── test-iris-transformer.py

Dockerfile

The Transformer is packaged as a minimal Docker image based on the official Python 3.11 slim base image. Keeping the image lean reduces pull latency, attack surface, and cold-start time in the cluster.

services/kserve-transformer/Dockerfile
FROM python:3.11-slim (1)

WORKDIR /app

# Prevent Python from buffering stdout/stderr
ENV PYTHONUNBUFFERED=1 (2)

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    && rm -rf /var/lib/apt/lists/* (3)

# Install KServe SDK (uvicorn[standard] is pulled in automatically by kserve)
RUN pip install --no-cache-dir kserve==0.16.0 (4)

# Copy the transformer script
COPY services/kserve-transformer/iris-transformer.py .

# Expose the default KServe model server port
EXPOSE 8080 (5)

# Start the transformer; --predictor_host is injected by KServe at runtime
ENTRYPOINT ["python", "iris-transformer.py"] (6)
1 The python:3.11-slim base image provides a minimal Debian environment with Python 3.11 pre-installed. The slim variant omits development headers and documentation, significantly reducing the final image size.
2 Setting PYTHONUNBUFFERED=1 forces Python to write output directly to stdout and stderr without internal buffering. This ensures that log lines appear immediately in the pod’s log stream, which is critical for real-time observability in Kubernetes.
3 curl is installed for use in health check scripts and ad-hoc debugging inside the pod. The --no-install-recommends flag and the subsequent rm -rf /var/lib/apt/lists/* remove cached package metadata, keeping the image layer as small as possible.
4 The kserve Python SDK is pinned to version 0.16.0 for reproducibility. Installing it also pulls in uvicorn[standard] and httpx as transitive dependencies, so no additional packages need to be listed explicitly.
5 Port 8080 is the default port that KServe model servers listen on. Declaring it with EXPOSE documents the intent and allows container orchestration tooling to discover the service port automatically.
6 The ENTRYPOINT launches the Transformer directly. When KServe creates the pod, it injects --model_name, --predictor_host, and other standard flags as command-line arguments, which kserve.model_server.parser parses at startup.

After building the image, push it to a container registry accessible by your Kubernetes cluster. The image referenced throughout this guide is published to Docker Hub as credemol/mlops-iris-kserve-transformer:1.0.2.

$ docker build -t credemol/mlops-iris-kserve-transformer:1.0.2 \
    -f services/kserve-transformer/Dockerfile .
$ docker push credemol/mlops-iris-kserve-transformer:1.0.2

Deploy InferenceService with Transformer

With the Transformer image available in the registry, the InferenceService manifest is extended with a transformer section. KServe schedules the Transformer container in the same pod as the predictor and automatically wires the inter-container routing.

iris-inference-with-transformer.yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "iris-classifier"
  namespace: kserve
spec:
  (1)
  predictor:
    serviceAccountName: sa-s3-access
    model:
      modelFormat:
        name: "sklearn"
      storageUri: "s3://nsa2-sf-ml-models/mlflow/2/models/m-6db63970926b4ad496bf8aaf4cda2c9e/artifacts"
      protocolVersion: v2
      runtime: kserve-sklearnserver

  (2)
  transformer:
    containers:
      - image: "credemol/mlops-iris-kserve-transformer:1.0.2"
        name: mlops-iris-kserve-transformer
1 The predictor section is identical to the one used in the previous article. The sa-s3-access service account grants the predictor pod permission to pull the model artifact from the S3 bucket via IRSA (IAM Roles for Service Accounts). The kserve-sklearnserver runtime loads the scikit-learn model and serves it over the V2 protocol.
2 The transformer section introduces the custom container. KServe automatically injects the --predictor_host argument at runtime, pointing the Transformer at the internal predictor service. No manual service discovery or environment variable configuration is needed.

Deploy the InferenceService to the cluster:

$ kubectl apply -f services/kserve-transformer/iris-inference-with-transformer.yaml

Verify that the InferenceService reaches the Ready state — this confirms that both the predictor and Transformer containers have started successfully and passed their readiness probes:

$ kubectl get inferenceservice iris-classifier -n kserve

You should see a READY status of True and a populated URL field within a minute or two of applying the manifest.

Test the InferenceService with curl

Once the InferenceService is ready, you can validate the end-to-end pipeline using curl. The request uses the simple application payload — not the V2 protocol format — demonstrating that the Transformer is handling the conversion transparently.

Save the test payload to a file:

simple_input_1.json
{
    "sepal_length": 5.1,
    "sepal_width": 3.5,
    "petal_length": 1.4,
    "petal_width": 0.2
}

Send the request to the V1 predict endpoint, which bypasses V2 schema validation and routes the raw JSON to preprocess():

$ SERVICE_HOSTNAME="iris-classifier-transformer-kserve.servicefoundry.org" (1)

curl -s -X POST \
  -H "Host: ${SERVICE_HOSTNAME}" \        (2)
  -H "Content-Type: application/json" \
  -d @./simple_input_1.json \
  https://${SERVICE_HOSTNAME}/v1/models/iris-classifier:predict | jq  (3)

# Expected output:
{
  "predictions": [
    "Iris-Setosa"
  ]
}
1 The SERVICE_HOSTNAME is the external hostname assigned to the InferenceService by the Ingress or Gateway controller. It follows the pattern {name}-{namespace}.{domain} by convention.
2 The Host header is included so that the Ingress or Istio Gateway can route the request to the correct InferenceService virtual service. When testing against a domain that resolves through DNS, this header is set automatically.
3 The response is piped through jq for pretty-printed JSON output. The Iris-Setosa label confirms that the full pipeline — preprocess → predictor → postprocess — completed successfully.

KServe Transformer Client Application

With the end-to-end pipeline verified, we can build a Streamlit web application that exposes the Iris classification service through an intuitive user interface. The application sends the same simple JSON payload to the Transformer endpoint, keeping all V2 protocol concerns contained within the Transformer layer and completely invisible to the UI code.

service/kserve-transformer-client/app.py
import os
import logging

import requests
import streamlit as st

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# -----------------------------------------------------------------------
# Configuration from environment variables
# -----------------------------------------------------------------------
HOSTNAME = os.getenv("HOSTNAME", "iris-classifier-kserve.servicefoundry.org") (1)
MODEL_NAME = os.getenv("MODEL_NAME", "iris-classifier")
FULL_URL = f"https://{HOSTNAME}/v1/models/{MODEL_NAME}:predict"


# -----------------------------------------------------------------------
# Helper functions
# -----------------------------------------------------------------------

def build_payload(sepal_length: float, sepal_width: float,
                  petal_length: float, petal_width: float) -> dict: (2)
    """Build the simple JSON payload that the KServe Transformer accepts."""
    return {
        "sepal_length": sepal_length,
        "sepal_width": sepal_width,
        "petal_length": petal_length,
        "petal_width": petal_width,
    }


def get_prediction(payload: dict, url: str = FULL_URL) -> dict | None: (3)
    """
    POST the payload to the KServe Transformer endpoint.

    Returns the parsed JSON response dict, or None on error.
    """
    try:
        logger.info("POST %s  payload=%s", url, payload)
        response = requests.post(
            url,
            json=payload,
            headers={"Content-Type": "application/json"},
            timeout=30, (4)
        )
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as exc:
        logger.error("Request failed: %s", exc)
        st.error(f"Request failed: {exc}") (5)
        return None


def parse_prediction(response: dict) -> str | None: (6)
    """Extract the first prediction label from the transformer response."""
    predictions = response.get("predictions", [])
    if predictions:
        return predictions[0]
    return None

# -----------------------------------------------------------------------
# Streamlit UI
# -----------------------------------------------------------------------

def main():
    st.set_page_config(
        page_title="Iris Classifier — Transformer",
        page_icon="🌸",
        layout="centered",
    )

    st.title("🌸 Iris Flower Classification")
    st.caption(f"Powered by KServe Transformer · `{FULL_URL}`")

    st.divider()

    # ── Sidebar: feature inputs ──────────────────────────────────────────
    st.sidebar.header("🌿 Flower Measurements") (7)

    sepal_length = st.sidebar.slider("Sepal Length (cm)", 4.0, 8.0, 5.1, 0.1)
    sepal_width  = st.sidebar.slider("Sepal Width (cm)",  2.0, 4.5, 3.5, 0.1)
    petal_length = st.sidebar.slider("Petal Length (cm)", 1.0, 7.0, 1.4, 0.1)
    petal_width  = st.sidebar.slider("Petal Width (cm)",  0.1, 2.5, 0.2, 0.1) (8)

    # ── Main panel: input summary ────────────────────────────────────────
    st.subheader("Input Parameters")
    col1, col2 = st.columns(2)
    col1.metric("Sepal Length", f"{sepal_length} cm")
    col1.metric("Sepal Width",  f"{sepal_width} cm")
    col2.metric("Petal Length", f"{petal_length} cm")
    col2.metric("Petal Width",  f"{petal_width} cm")

    st.divider()

    # ── Classify button ──────────────────────────────────────────────────
    if st.button("🔍 Classify", use_container_width=True, type="primary"): (9)
        payload = build_payload(sepal_length, sepal_width, petal_length, petal_width)

        with st.spinner("Calling KServe Transformer…"):
            result = get_prediction(payload)

        if result is not None:
            label = parse_prediction(result)

            if label:
                st.success(f"### ✅ Predicted Species: **{label}**") (10)
            else:
                st.warning("The transformer returned an empty predictions list.")

            with st.expander("📄 Raw JSON Response"):
                st.json(result)

            with st.expander("📤 Request Payload Sent"):
                st.json(payload)


if __name__ == "__main__":
    main()
1 HOSTNAME and MODEL_NAME are read from environment variables with sensible defaults. Externalising configuration via environment variables allows the same application image to target different environments (local, staging, production) without rebuilding.
2 build_payload() constructs the lightweight JSON object expected by the Transformer’s V1 endpoint. The function signature uses named parameters with explicit float type hints, providing a clear API contract for callers.
3 get_prediction() encapsulates all HTTP communication with the Transformer endpoint. Returning None on error rather than raising an exception allows the UI to display a user-friendly error message instead of an unhandled traceback.
4 A 30-second timeout is applied to the HTTP request. This prevents the UI from becoming unresponsive if the Transformer or predictor pod is slow or temporarily unavailable.
5 Errors are surfaced both through Python’s logging framework (for server-side observability) and through Streamlit’s st.error() widget (for the end user). This dual reporting ensures that failures are visible at all levels.
6 parse_prediction() safely extracts the first element from the predictions list. The get() call with an empty-list default prevents a KeyError if the response schema is unexpected.
7 Sliders in the sidebar provide an intuitive input mechanism for the four Iris measurements. The ranges and default values are set to realistic botanical values that reflect the distribution of the Iris dataset.
8 The slider ranges and step sizes are chosen to cover the full observed range of each feature in the Iris dataset (4.3–7.9 cm sepal length, 2.0–4.4 cm sepal width, etc.), ensuring the test inputs are always within the model’s training distribution.
9 Streamlit re-renders the page on every interaction. The st.button() call captures a single click event, after which build_payload() and get_prediction() are called sequentially inside the button block.
10 st.success() renders a green banner with the predicted species name. The expandable sections below it let power users inspect both the raw JSON response from the Transformer and the payload that was sent, which is useful for debugging and demonstration purposes.

Run the Client Application

The application can be run locally for development and testing:

$ streamlit run services/kserve-transformer-client/app.py \
  --server.port 8501 \
  --server.address 0.0.0.0

For production use, the application can be containerised and deployed as a standalone workload on Kubernetes, managed by ArgoCD alongside the rest of the MLOps platform.

kserve transformer client app
Figure 1. KServe Transformer Client Application

Use Cases for KServe Transformer

The Iris classifier example demonstrates a simple format-conversion pattern, but the Transformer’s extensible hook architecture supports a wide range of production ML serving concerns.

Pre-Processing

The preprocess() hook is the ideal location for any logic that must run before the model sees the data:

  • Feature engineering — Compute derived features on the fly (e.g., BMI from height and weight, log-transforms, polynomial interactions, or temporal bucketing) without modifying the model or retraining.

  • Vocabulary / tokenization — Tokenize raw text input, apply Byte-Pair Encoding (BPE), and pad or truncate sequences to the model’s maximum context length for NLP workloads.

  • Image pre-processing — Resize and centre-crop images, normalise pixel values, and convert JPEG or PNG bytes into floating-point tensors for computer vision models.

  • Data validation — Reject malformed, out-of-range, or missing inputs with an HTTP 400 error before they consume GPU resources or corrupt model outputs.

  • Feature store lookup — Enrich a sparse request (e.g., a user_id) with a full feature vector fetched from an online feature store such as Redis, Feast, or Tecton.

  • Multi-modal fusion — Combine heterogeneous inputs (e.g., an image URL and a text prompt) into a single unified tensor before dispatching to a vision-language model.

  • A/B request routing — Inspect a request header or a user segment flag and route traffic to different model versions for controlled experimentation.

  • PII scrubbing — Strip or mask sensitive fields (SSNs, email addresses, phone numbers) before they reach the model log or downstream storage systems.

Post-Processing

The postprocess() hook executes after the model returns its raw output, making it ideal for humanising and enriching the response:

  • Class index to label mapping — Convert integer class indices into human-readable category names, exactly as demonstrated in this article.

  • Threshold and top-K filtering — Discard softmax probabilities below a confidence threshold, or return only the top K most likely predictions rather than the full distribution.

  • Ensemble aggregation — Fan out the request to multiple predictors in parallel, then combine their responses using majority voting, probability averaging, or stacking.

  • Calibration — Apply post-hoc calibration techniques such as Platt scaling or temperature scaling to produce better-calibrated probability estimates from overconfident models.

  • Response enrichment — Attach operational metadata to the response — model version, inference latency, feature importance scores, or confidence intervals — without modifying the model itself.

  • Embedding decoding — Convert raw embedding vectors into nearest-neighbour labels retrieved from a vector database such as Pinecone, Weaviate, or pgvector.

  • Audit logging — Persist input/output pairs to an append-only time-series store for drift monitoring, regulatory compliance, or retraining dataset construction.

  • Currency and unit formatting — Round pricing predictions to two decimal places, apply currency symbols, or convert raw values to locale-appropriate representations.

Cross-Cutting Patterns

Beyond pre- and post-processing, the Transformer can implement infrastructure patterns that span the entire request lifecycle:

  • Response caching — Cache frequent or identical requests in Redis and short-circuit the predictor entirely on cache hits, dramatically reducing latency and GPU utilisation for repetitive workloads.

  • Rate limiting — Track per-client or per-user request counts and return HTTP 429 responses when quotas are exceeded, protecting the predictor from overload.

  • Shadow mode (dark launch) — Forward each request to both the production model and a candidate model simultaneously, log any discrepancies, and return only the production model’s response — enabling safe evaluation of new models under real traffic without any user impact.

  • Canary gating — Route a configurable percentage of traffic (e.g., 5%) to a new model version while returning the stable model’s response to all users. This allows continuous monitoring of the canary’s behaviour before a full rollout.

  • Retrieval-Augmented Generation (RAG) — In the preprocess() hook, embed the incoming query, retrieve the top-K relevant documents from a vector store, and construct an augmented prompt. In postprocess(), strip internal chain-of-thought tokens before returning the final answer to the client.

Why a Transformer Instead of the Client or the Model?

It may be tempting to embed transformation logic in the client application or in a custom model serving script. The Transformer pattern offers distinct advantages over both alternatives:

Advantage Explanation

Separation of concerns

The model image remains a pure ML artefact. All business logic, protocol translation, and data engineering live in the Transformer — a completely separate codebase with its own release cycle.

Protocol independence

Clients can send data in any JSON shape that is natural for the application. The Transformer absorbs all V2 protocol complexity, so client developers never need to understand ml serving internals.

Reusability

A single Transformer can front multiple model versions. When the model is retrained or updated, the Transformer requires no changes as long as the input/output semantics remain the same.

Independent scaling

Transformer pods and predictor pods scale independently. CPU-intensive feature engineering (e.g., tokenization, image decoding) does not compete with GPU-bound inference for the same compute resources.

Conclusion

The KServe Transformer provides a principled, production-ready mechanism for decoupling application data formats from ML inference protocols. By centralising all pre-processing, post-processing, and cross-cutting concerns in a single, independently deployable component, teams can evolve their models and client applications at independent velocities without breaking the contract between them.

In this article, we built a complete Transformer pipeline for the Iris classification service: a custom Python IrisTransformer class that converts a simple botanical JSON payload into the V2 Open Inference Protocol format, forwards it to the predictor, and maps the predicted class index back into a human-readable species name. We packaged the Transformer as a Docker image, extended the InferenceService manifest to include it, validated the pipeline with curl, and connected a Streamlit client application to demonstrate the full end-to-end user experience.

The use-case catalogue in the final section illustrates how the same pattern scales to sophisticated production scenarios — from feature store enrichment and ensemble aggregation to RAG pipelines and canary deployments — all without modifying a single line of model code.