Service Foundry
Young Gyu Kim <credemol@gmail.com>

Running Spark Applications on Kubernetes with Spark Operator and Airflow 3.0

intro

Introduction

This guide explains how to run Apache Spark applications on Kubernetes using the Spark Operator, and how to orchestrate those jobs with Apache Airflow 3.0. You’ll learn how to install the Spark Operator, configure your cluster, submit Spark jobs declaratively, and integrate with Airflow using the SparkKubernetesOperator.

What is the Spark Operator?

Apache Spark is a widely-used distributed computing engine for big data processing. The Spark Operator simplifies deploying and managing Spark applications on Kubernetes using Kubernetes-native tools. It introduces custom resource definitions (CRDs) so you can create, monitor, and delete Spark applications just like any other Kubernetes resource.

Installing the Spark Operator with Helm

Add the Spark Operator Helm repository

Add the Kubeflow version of the Spark Operator Helm repo:

add new Helm repository
$ helm repo add spark-operator https://kubeflow.github.io/spark-operator

$ helm repo update spark-operator

$ helm repo list | grep spark-operator

spark-operator          https://kubeflow.github.io/spark-operator

Download the Helm chart and values file

pull the spark-operator chart
$ helm pull spark-operator/spark-operator

# as of October 2025, the latest version is 2.3.0
download the values.yaml file.
$ helm show values --version 2.3.0 spark-operator/spark-operator > values-2.3.0.yaml

Customize the values file

custom-values.yaml
spark:
  # -- List of namespaces where to run spark jobs.
  jobNamespaces:
  - spark-jobs
  - airflow
  - default

For spark.jobNamespace, you can specify the namespace where the Spark applications will be created.

If the namespaces don’t already exist, you’ll need to create them before applying any resources. Run the following commands to verify and create them if necessary:

$ kubectl get namespace spark-jobs || kubectl create namespace spark-jobs
$ kubectl get namespace airflow || kubectl create namespace airflow

Install the Spark Operator

Run the following command to install the Spark Operator using Helm with the custom values file.

$ helm install spark-operator spark-operator/spark-operator --namespace spark-operator --create-namespace -f custom-values.yaml

Uninstall the Spark Operator

To uninstall the operator:

$ helm uninstall spark-operator -n spark-operator

Create a Sample Spark Application

You can find examples in the official GitHub repo.

Here is a basic example:

examples/spark-pi.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi
  #namespace: default
  namespace: spark-jobs
spec:
  type: Scala
  mode: cluster
  image: docker.io/library/spark:4.0.0
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples.jar
  arguments:
  - "5000"
  sparkVersion: 4.0.0
  driver:
    labels:
      version: 4.0.0
    cores: 1
    memory: 512m
    serviceAccount: spark-operator-spark
    securityContext:
      capabilities:
        drop:
        - ALL
      runAsGroup: 185
      runAsUser: 185
      runAsNonRoot: true
      allowPrivilegeEscalation: false
      seccompProfile:
        type: RuntimeDefault
  executor:
    labels:
      version: 4.0.0
    instances: 1
    cores: 1
    memory: 512m
    securityContext:
      capabilities:
        drop:
        - ALL
      runAsGroup: 185
      runAsUser: 185
      runAsNonRoot: true
      allowPrivilegeEscalation: false
      seccompProfile:
        type: RuntimeDefault

Update the namespace to spark-jobs if desired.

# Create an example Spark application in the spark-jobs namespace
$ kubectl apply -f examples/spark-pi.yaml

To verify execution:

$ kubectl -n spark-jobs get pods
$ kubectl -n spark-jobs get sparkapplications
$ kubectl -n spark-jobs logs -f spark-pi-driver

Integrating Spark Operator with Airflow 3.0

To learn how to install Airflow 3.0 on Kubernetes, refer to:

Organizing Spark Applications in Airflow Git Repo

To submit Spark jobs via Airflow, organize your repository like this:

file structure in Airflow Git Repository
$ tree dags --dirsfirst
dags
├── spark-apps
│   └── spark-pi.yaml
├── hello_world_dag.py
└── spark-py-example.py

File descriptions:

  • spark-py-example.py: An example DAG that uses SparkKubernetesOperator to submit a Spark application.

  • spark-apps/spark-pi.yaml: The Spark application YAML file used in the DAG.

The spark-apps/ folder must be under dags/.

Example DAG with SparkKubernetesOperator

This Airflow DAG uses SparkKubernetesOperator to submit the Spark app:

spark-py-example.py
from datetime import timedelta, datetime
from airflow.operators.python import PythonOperator
from airflow import DAG

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime.now() - timedelta(days=1),
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'max_active_runs': 1,
    'retries': 0,
}


def startBatch():
    print('##### startBatch #####')

def done():
    print('##### done #####')

with DAG(
    dag_id='spark_pi',
    start_date=datetime.now() - timedelta(days=1),
    default_args=default_args,
    schedule=None,
    tags=['example']
) as dag:
    spark_pi_task = SparkKubernetesOperator(
        task_id='spark_example',
        namespace='airflow',
        application_file='spark-apps/spark-pi.yaml',
        kubernetes_conn_id='kubernetes_default',

    )

    start_batch_task = PythonOperator(
        task_id='startBatch',
        python_callable=startBatch
    )
    done_task = PythonOperator(
        task_id='done',
        python_callable=done
    )


    start_batch_task >> spark_pi_task >> done_task

Make sure the spark-pi.yaml file references the same namespace used in the DAG:

spark-apps/spark-pi.yaml

This is an example Spark application provided by Spark Operator. Make sure the namespace matches the one used in the Airflow DAG.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi
  # <1> Make sure the namespace matches the one used in the Airflow DAG
  namespace: airflow
spec:
  type: Scala
  mode: cluster
  image: docker.io/library/spark:4.0.0
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples.jar
  arguments:
    - "5000"
  sparkVersion: 4.0.0
  driver:
    labels:
      version: 4.0.0
    cores: 1
    memory: 512m
    serviceAccount: spark-operator-spark
    securityContext:
      capabilities:
        drop:
          - ALL
      runAsGroup: 185
      runAsUser: 185
      runAsNonRoot: true
      allowPrivilegeEscalation: false
      seccompProfile:
        type: RuntimeDefault
  executor:
    labels:
      version: 4.0.0
    instances: 1
    cores: 1
    memory: 512m
    securityContext:
      capabilities:
        drop:
          - ALL
      runAsGroup: 185
      runAsUser: 185
      runAsNonRoot: true
      allowPrivilegeEscalation: false
      seccompProfile:
        type: RuntimeDefault

RBAC for SparkKubernetesOperator

Airflow needs permission to create SparkApplication resources. Apply the following RBAC:

spark-rbac.yaml
# spark-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: spark-application-role
  namespace: airflow
rules:
  - apiGroups: ["sparkoperator.k8s.io"]
    resources:
      - "sparkapplications"
      - "sparkapplications/status"
      - "sparkapplications/finalizers"
    verbs:
      - create
      - get
      - list
      - watch
      - update
      - patch
      - delete

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-application-rolebinding
  namespace: airflow
subjects:
  - kind: ServiceAccount
    name: airflow-worker
    namespace: airflow
roleRef:
  kind: Role
  name: spark-application-role
  apiGroup: rbac.authorization.k8s.io

Apply the RBAC configuration:

$ kubectl apply -f spark-rbac.yaml

The result of executing the DAG

Once you trigger the DAG in Airflow, it creates a SparkApplication resource:

$ kubectl get sparkapplications -n airflow

You can track the DAG progress in the Airflow UI:

airflow dag execution
Figure 1. Airflow DAG Execution

Conclusion

This guide walked you through running Spark jobs on Kubernetes using the Spark Operator, and integrating them into Apache Airflow 3.0 with SparkKubernetesOperator. With this setup, you get the scalability of Kubernetes, the orchestration power of Airflow, and the simplicity of Spark Operator’s declarative job submission.

📘 View the web version:

Thanks for reading!