Running Spark Applications on Kubernetes with Spark Operator and Airflow 3.0
Introduction
This guide explains how to run Apache Spark applications on Kubernetes using the Spark Operator, and how to orchestrate those jobs with Apache Airflow 3.0. You’ll learn how to install the Spark Operator, configure your cluster, submit Spark jobs declaratively, and integrate with Airflow using the SparkKubernetesOperator.
What is the Spark Operator?
Apache Spark is a widely-used distributed computing engine for big data processing. The Spark Operator simplifies deploying and managing Spark applications on Kubernetes using Kubernetes-native tools. It introduces custom resource definitions (CRDs) so you can create, monitor, and delete Spark applications just like any other Kubernetes resource.
Installing the Spark Operator with Helm
Add the Spark Operator Helm repository
Add the Kubeflow version of the Spark Operator Helm repo:
$ helm repo add spark-operator https://kubeflow.github.io/spark-operator
$ helm repo update spark-operator
$ helm repo list | grep spark-operator
spark-operator https://kubeflow.github.io/spark-operator
Download the Helm chart and values file
$ helm pull spark-operator/spark-operator
# as of October 2025, the latest version is 2.3.0
$ helm show values --version 2.3.0 spark-operator/spark-operator > values-2.3.0.yaml
Customize the values file
spark:
# -- List of namespaces where to run spark jobs.
jobNamespaces:
- spark-jobs
- airflow
- default
For spark.jobNamespace, you can specify the namespace where the Spark applications will be created.
If the namespaces don’t already exist, you’ll need to create them before applying any resources. Run the following commands to verify and create them if necessary:
$ kubectl get namespace spark-jobs || kubectl create namespace spark-jobs
$ kubectl get namespace airflow || kubectl create namespace airflow
Install the Spark Operator
Run the following command to install the Spark Operator using Helm with the custom values file.
$ helm install spark-operator spark-operator/spark-operator --namespace spark-operator --create-namespace -f custom-values.yaml
Uninstall the Spark Operator
To uninstall the operator:
$ helm uninstall spark-operator -n spark-operator
Create a Sample Spark Application
You can find examples in the official GitHub repo.
Here is a basic example:
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
#namespace: default
namespace: spark-jobs
spec:
type: Scala
mode: cluster
image: docker.io/library/spark:4.0.0
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples.jar
arguments:
- "5000"
sparkVersion: 4.0.0
driver:
labels:
version: 4.0.0
cores: 1
memory: 512m
serviceAccount: spark-operator-spark
securityContext:
capabilities:
drop:
- ALL
runAsGroup: 185
runAsUser: 185
runAsNonRoot: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
executor:
labels:
version: 4.0.0
instances: 1
cores: 1
memory: 512m
securityContext:
capabilities:
drop:
- ALL
runAsGroup: 185
runAsUser: 185
runAsNonRoot: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
Update the namespace to spark-jobs if desired.
# Create an example Spark application in the spark-jobs namespace
$ kubectl apply -f examples/spark-pi.yaml
To verify execution:
$ kubectl -n spark-jobs get pods
$ kubectl -n spark-jobs get sparkapplications
$ kubectl -n spark-jobs logs -f spark-pi-driver
Integrating Spark Operator with Airflow 3.0
To learn how to install Airflow 3.0 on Kubernetes, refer to:
Organizing Spark Applications in Airflow Git Repo
To submit Spark jobs via Airflow, organize your repository like this:
$ tree dags --dirsfirst
dags
├── spark-apps
│ └── spark-pi.yaml
├── hello_world_dag.py
└── spark-py-example.py
File descriptions:
-
spark-py-example.py: An example DAG that uses SparkKubernetesOperator to submit a Spark application.
-
spark-apps/spark-pi.yaml: The Spark application YAML file used in the DAG.
|
The spark-apps/ folder must be under dags/. |
Example DAG with SparkKubernetesOperator
This Airflow DAG uses SparkKubernetesOperator to submit the Spark app:
from datetime import timedelta, datetime
from airflow.operators.python import PythonOperator
from airflow import DAG
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now() - timedelta(days=1),
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False,
'max_active_runs': 1,
'retries': 0,
}
def startBatch():
print('##### startBatch #####')
def done():
print('##### done #####')
with DAG(
dag_id='spark_pi',
start_date=datetime.now() - timedelta(days=1),
default_args=default_args,
schedule=None,
tags=['example']
) as dag:
spark_pi_task = SparkKubernetesOperator(
task_id='spark_example',
namespace='airflow',
application_file='spark-apps/spark-pi.yaml',
kubernetes_conn_id='kubernetes_default',
)
start_batch_task = PythonOperator(
task_id='startBatch',
python_callable=startBatch
)
done_task = PythonOperator(
task_id='done',
python_callable=done
)
start_batch_task >> spark_pi_task >> done_task
Make sure the spark-pi.yaml file references the same namespace used in the DAG:
spark-apps/spark-pi.yaml
This is an example Spark application provided by Spark Operator. Make sure the namespace matches the one used in the Airflow DAG.
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
# <1> Make sure the namespace matches the one used in the Airflow DAG
namespace: airflow
spec:
type: Scala
mode: cluster
image: docker.io/library/spark:4.0.0
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples.jar
arguments:
- "5000"
sparkVersion: 4.0.0
driver:
labels:
version: 4.0.0
cores: 1
memory: 512m
serviceAccount: spark-operator-spark
securityContext:
capabilities:
drop:
- ALL
runAsGroup: 185
runAsUser: 185
runAsNonRoot: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
executor:
labels:
version: 4.0.0
instances: 1
cores: 1
memory: 512m
securityContext:
capabilities:
drop:
- ALL
runAsGroup: 185
runAsUser: 185
runAsNonRoot: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
RBAC for SparkKubernetesOperator
Airflow needs permission to create SparkApplication resources. Apply the following RBAC:
# spark-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: spark-application-role
namespace: airflow
rules:
- apiGroups: ["sparkoperator.k8s.io"]
resources:
- "sparkapplications"
- "sparkapplications/status"
- "sparkapplications/finalizers"
verbs:
- create
- get
- list
- watch
- update
- patch
- delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spark-application-rolebinding
namespace: airflow
subjects:
- kind: ServiceAccount
name: airflow-worker
namespace: airflow
roleRef:
kind: Role
name: spark-application-role
apiGroup: rbac.authorization.k8s.io
Apply the RBAC configuration:
$ kubectl apply -f spark-rbac.yaml
The result of executing the DAG
Once you trigger the DAG in Airflow, it creates a SparkApplication resource:
$ kubectl get sparkapplications -n airflow
You can track the DAG progress in the Airflow UI:
Conclusion
This guide walked you through running Spark jobs on Kubernetes using the Spark Operator, and integrating them into Apache Airflow 3.0 with SparkKubernetesOperator. With this setup, you get the scalability of Kubernetes, the orchestration power of Airflow, and the simplicity of Spark Operator’s declarative job submission.
📘 View the web version:
Thanks for reading!
-
Young Gyu Kim (credemol@gmail.com)