Examples
The following examples have the following spec fields in common:
- 
version: the current version is "1.0"
- 
sparkImage: the docker image that is used by job, driver and executor pods. This can be provided by the user.
- 
mode: onlyclusteris currently supported
- 
mainApplicationFile: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
- 
args: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.
- 
sparkConf: these list spark configuration settings that are passed directly tospark-submitand which are best defined explicitly by the user. Since theSparkApplication"knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together.
- 
volumes: refers to any volumes needed by theSparkApplication, in this case an underlyingPersistentVolumeClaim.
- 
driver: driver-specific settings, including any volume mounts.
- 
executor: executor-specific settings, including any volume mounts.
Job-specific settings are annotated below.
Pyspark: externally located dataset, artifact available via PVC/volume mount
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-image
  namespace: default
spec:
  image: oci.stackable.tech/stackable/ny-tlc-report:0.2.0 (1)
  sparkImage:
    productVersion: 3.5.6
  mode: cluster
  mainApplicationFile: local:///stackable/spark/jobs/ny_tlc_report.py (2)
  args:
    - "--input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'" (3)
  deps:
    requirements:
      - tabulate==0.8.9 (4)
  sparkConf: (5)
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
  job:
    config:
      resources:
        cpu:
          min: "1"
          max: "1"
        memory:
          limit: "1Gi"
  driver:
    config:
      resources:
        cpu:
          min: "1"
          max: "1500m"
        memory:
          limit: "1Gi"
  executor:
    replicas: 3
    config:
      resources:
        cpu:
          min: "1"
          max: "4"
        memory:
          limit: "2Gi"| 1 | Job image: this contains the job artifact that is retrieved from the volume mount backed by the PVC | 
| 2 | Job python artifact (local) | 
| 3 | Job argument (external) | 
| 4 | List of python job requirements: these are installed in the Pods via pip. | 
| 5 | Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store) | 
JVM (Scala): externally located artifact and dataset
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-pvc
  namespace: default
spec:
  sparkImage:
    productVersion: 3.5.6
  mode: cluster
  mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.0-SNAPSHOT.jar (1)
  mainClass: org.example.App (2)
  args:
    - "'s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'"
  sparkConf: (3)
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
    "spark.driver.extraClassPath": "/dependencies/jars/*"
    "spark.executor.extraClassPath": "/dependencies/jars/*"
  volumes:
    - name: job-deps (4)
      persistentVolumeClaim:
        claimName: pvc-ksv
  driver:
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (5)
  executor:
    replicas: 3
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (5)| 1 | Job artifact located on S3. | 
| 2 | Job main class | 
| 3 | Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials) | 
| 4 | the name of the volume mount backed by a PersistentVolumeClaimthat must be pre-existing | 
| 5 | the path on the volume mount: this is referenced in the sparkConfsection where the extra class path is defined for the driver and executors | 
JVM (Scala): externally located artifact accessed with credentials
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-s3-private
spec:
  sparkImage:
    productVersion: 3.5.6
  mode: cluster
  mainApplicationFile: s3a://my-bucket/spark-examples.jar (1)
  mainClass: org.apache.spark.examples.SparkPi (2)
  s3connection: (3)
    inline:
      host: test-minio
      port: 9000
      accessStyle: Path
      credentials: (4)
        secretClass: s3-credentials-class
  sparkConf: (5)
    spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" (6)
    spark.driver.extraClassPath: "/dependencies/jars/hadoop-aws-3.2.0.jar:/dependencies/jars/aws-java-sdk-bundle-1.11.375.jar"
    spark.executor.extraClassPath: "/dependencies/jars/hadoop-aws-3.2.0.jar:/dependencies/jars/aws-java-sdk-bundle-1.11.375.jar"
  executor:
    replicas: 3| 1 | Job python artifact (located in an S3 store) | 
| 2 | Artifact class | 
| 3 | S3 section, specifying the existing secret and S3 end-point (in this case, MinIO) | 
| 4 | Credentials referencing a secretClass (not shown in is example) | 
| 5 | Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources… | 
| 6 | …in this case, in an S3 store, accessed with the credentials defined in the secret | 
JVM (Scala): externally located artifact accessed with job arguments provided via configuration map
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cm-job-arguments (1)
data:
  job-args.txt: |
    s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv (2)---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: ny-tlc-report-configmap
  namespace: default
spec:
  sparkImage:
    productVersion: 3.5.6
  mode: cluster
  mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.1.0.jar (3)
  mainClass: tech.stackable.demo.spark.NYTLCReport
  volumes:
    - name: cm-job-arguments
      configMap:
        name: cm-job-arguments (4)
  args:
    - "--input /arguments/job-args.txt" (5)
  sparkConf:
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
  driver:
    config:
      volumeMounts:
        - name: cm-job-arguments (6)
          mountPath: /arguments  (7)
  executor:
    replicas: 3
    config:
      volumeMounts:
        - name: cm-job-arguments (6)
          mountPath: /arguments (7)| 1 | Name of the configuration map | 
| 2 | Argument required by the job | 
| 3 | Job scala artifact that requires an input argument | 
| 4 | The volume backed by the configuration map | 
| 5 | The expected job argument, accessed via the mounted configuration map file | 
| 6 | The name of the volume backed by the configuration map that is mounted to the driver/executor | 
| 7 | The mount location of the volume (this contains a file /arguments/job-args.txt) |