Home Big Data Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

0
Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

[ad_1]

Amazon EMR on EKS offers a deployment choice for Amazon EMR that enables organizations to run open-source huge knowledge frameworks on Amazon Elastic Kubernetes Service (Amazon EKS). With EMR on EKS, Spark purposes run on the Amazon EMR runtime for Apache Spark. This performance-optimized runtime provided by Amazon EMR makes your Spark jobs run quick and cost-effectively. The EMR runtime offers as much as 5.37 occasions higher efficiency and 76.8% price financial savings, when in comparison with utilizing open-source Apache Spark on Amazon EKS.

Constructing on the success of Amazon EMR on EKS, prospects have been operating and managing jobs utilizing the emr-containers API, creating EMR digital clusters, and submitting jobs to the EKS cluster, both by means of the AWS Command Line Interface (AWS CLI) or Apache Airflow scheduler. Nonetheless, different prospects operating Spark purposes have chosen Spark Operator or native spark-submit to outline and run Apache Spark jobs on Amazon EKS, however with out profiting from the efficiency features from operating Spark on the optimized EMR runtime. In response to this want, ranging from EMR 6.10, we now have launched a brand new characteristic that permits you to use the optimized EMR runtime whereas submitting and managing Spark jobs by means of both Spark Operator or spark-submit. Which means that anybody operating Spark workloads on EKS can reap the benefits of EMR’s optimized runtime.

On this put up, we stroll by means of the method of organising and operating Spark jobs utilizing each Spark Operator and spark-submit, built-in with the EMR runtime characteristic. We offer step-by-step directions to help you in organising the infrastructure and submitting a job with each strategies. Moreover, you should utilize the Knowledge on EKS blueprint to deploy your entire infrastructure utilizing Terraform templates.

Infrastructure overview

On this put up, we stroll by means of the method of deploying a complete answer utilizing eksctl, Helm, and AWS CLI. Our deployment contains the next sources:

  • A VPC, EKS cluster, and managed node group, arrange with the eksctl software
  • Important Amazon EKS managed add-ons, such because the VPC CNI, CoreDNS, and KubeProxy arrange with the eksctl software
  • Cluster Autoscaler and Spark Operator add-ons, arrange utilizing Helm
  • A Spark job execution AWS Id and Entry Administration (IAM) position, IAM coverage for Amazon Easy Storage Service (Amazon S3) bucket entry, service account, and role-based entry management, arrange utilizing the AWS CLI and eksctl

Conditions

Confirm that the next conditions are put in in your machine:

Arrange AWS credentials

Earlier than continuing to the subsequent step and operating the eksctl command, it’s worthwhile to arrange your native AWS credentials profile. For directions, seek advice from Configuration and credential file settings.

Deploy the VPC, EKS cluster, and managed add-ons

The next configuration makes use of us-west-1 because the default Area. To run in a distinct Area, replace the area and availabilityZones fields accordingly. Additionally, confirm that the identical Area is used within the subsequent steps all through the put up.

Enter the next code snippet into the terminal the place your AWS credentials are arrange. Ensure that to replace the publicAccessCIDRs discipline along with your IP earlier than you run the command beneath. It will create a file named eks-cluster.yaml:

cat <<EOF >eks-cluster.yaml
---
apiVersion: eksctl.io/v1alpha5
form: ClusterConfig
metadata:
  identify: emr-spark-operator
  area: us-west-1 # substitute along with your area
  model: "1.25"
vpc:
  clusterEndpoints:
    publicAccess: true
    privateAccess: true
  publicAccessCIDRs: ["YOUR-IP/32"]
availabilityZones: ["us-west-1a","us-west-1b"] # substitute along with your area
iam:
  withOIDC: true
  serviceAccounts:
  - metadata:
      identify: cluster-autoscaler
      namespace: kube-system
    wellKnownPolicies:
      autoScaler: true
    roleName: eksctl-cluster-autoscaler-role
managedNodeGroups:
  - identify: m5x
    instanceType: m5.xlarge
    availabilityZones: ["us-west-1a"]
    volumeSize: 100
    volumeType: gp3
    minSize: 2
    desiredCapacity: 2
    maxSize: 10
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/eks-nvme: "owned" 
addons:
  - identify: vpc-cni
    model: newest
  - identify: coredns
    model: newest
  - identify: kube-proxy
    model: newest
cloudWatch:
  clusterLogging:
    enableTypes: ["*"]
EOF

Use the next command to create the EKS cluster : eksctl create cluster -f eks-cluster.yaml

Deploy Cluster Autoscaler

Cluster Autoscaler is essential for routinely adjusting the scale of your Kubernetes cluster based mostly on the present useful resource calls for, optimizing useful resource utilization and value. Create an autoscaler-helm-values.yaml file and set up the Cluster Autoscaler utilizing Helm:

cat <<EOF >autoscaler-helm-values.yaml
---
autoDiscovery:
    clusterName: emr-spark-operator
    tags:
      - k8s.io/cluster-autoscaler/enabled
      - k8s.io/cluster-autoscaler/{{ .Values.autoDiscovery.clusterName }}
awsRegion: us-west-1 # Ensure that the area identical because the EKS Cluster
rbac:
  serviceAccount:
    create: false
    identify: cluster-autoscaler
EOF

helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm set up nodescaler autoscaler/cluster-autoscaler 
--namespace kube-system 
--values autoscaler-helm-values.yaml --debug

It’s also possible to arrange Karpenter as a cluster autoscaler to routinely launch the suitable compute sources to deal with your EKS cluster’s purposes. You’ll be able to observe this weblog on easy methods to setup and configure Karpenter.

Deploy Spark Operator

Spark Operator is an open-source Kubernetes operator particularly designed to handle and monitor Spark purposes operating on Kubernetes. It streamlines the method of deploying and managing Spark jobs, by offering a Kubernetes customized useful resource to outline, configure and run Spark purposes, in addition to handle the job life cycle by means of Kubernetes API. Some prospects choose utilizing Spark Operator to handle Spark jobs as a result of it allows them to handle Spark purposes similar to different Kubernetes sources.

At present, prospects are constructing their open-source Spark photos and utilizing S3a committers as a part of job submissions with Spark Operator or spark-submit. Nonetheless, with the brand new job submission choice, now you can profit from the EMR runtime along with EMRFS. Beginning with Amazon EMR 6.10 and for every upcoming model of the EMR runtime, we are going to launch the Spark Operator and its Helm chart to make use of the EMR runtime.

On this part, we present you easy methods to deploy a Spark Operator Helm chart from an Amazon Elastic Container Registry (Amazon ECR) repository and submit jobs utilizing EMR runtime photos, benefiting from the efficiency enhancements supplied by the EMR runtime.

Set up Spark Operator with Helm from Amazon ECR

The Spark Operator Helm chart is saved in an ECR repository. To put in the Spark Operator, you first have to authenticate your Helm consumer with the ECR repository. The charts are saved beneath the next path: ECR_URI/spark-operator.

Authenticate your Helm consumer and set up the Spark Operator:

aws ecr get-login-password 
--region us-west-1 | helm registry login 
--username AWS 
--password-stdin 608033475327.dkr.ecr.us-west-1.amazonaws.com

You’ll be able to authenticate to different EMR on EKS supported Areas by acquiring the AWS account ID for the corresponding Area. For extra data, seek advice from easy methods to choose a base picture URI.

Set up Spark Operator

Now you can set up Spark Operator utilizing the next command:

helm set up spark-operator-demo 
oci://608033475327.dkr.ecr.us-west-1.amazonaws.com/spark-operator 
--set emrContainers.awsRegion=us-west-1 
--version 1.1.26-amzn-0 
--set serviceAccounts.spark.create=false 
--namespace spark-operator 
--create-namespace

To confirm that the operator has been put in accurately, run the next command:

helm record --namespace spark-operator -o yaml

Arrange the Spark job execution position and repair account

On this step, we create a Spark job execution IAM position and a service account, which can be utilized in Spark Operator and spark-submit job submission examples.

First, we create an IAM coverage that can be utilized by the IAM Roles for Service Accounts (IRSA). This coverage allows the driving force and executor pods to entry the AWS providers specified within the coverage. Full the next steps:

  1. As a prerequisite, both create an S3 bucket (aws s3api create-bucket --bucket <ENTER-S3-BUCKET> --create-bucket-configuration LocationConstraint=us-west-1 --region us-west-1) or use an present S3 bucket. Substitute <ENTER-S3-BUCKET> within the following code with the bucket identify.
  2. Create a coverage file that enables learn and write entry to an S3 bucket:
    cat >job-execution-policy.json <<EOL
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket",
                    "s3:PutObject",
                    "s3:DeleteObject",
                    "s3:AbortMultipartUpload",
                    "s3:ListMultipartUploadParts"
                ],
                "Useful resource": [
                    "arn:aws:s3:::<ENTER-S3-BUCKET>",
                    "arn:aws:s3:::<ENTER-S3-BUCKET>/*",
                    "arn:aws:s3:::aws-data-lake-workshop/*",
                    "arn:aws:s3:::nyc-tlc",
                    "arn:aws:s3:::nyc-tlc/*"
                ]
            }
        ]
    }
    EOL

  3. Create the IAM coverage with the next command:
    aws iam create-policy --policy-name emr-job-execution-policy --policy-document file://job-execution-policy.json

  4. Subsequent, create the service account named emr-job-execution-sa-role in addition to the IAM roles. The next eksctl command creates a service account scoped to the namespace and repair account outlined for use by the executor and driver. Ensure that to switch <ENTER_YOUR_ACCOUNT_ID> along with your account ID earlier than operating the command:
    eksctl create iamserviceaccount 
    --cluster=emr-spark-operator 
    --region us-west-1 
    --name=emr-job-execution-sa 
    --attach-policy-arn=arn:aws:iam::<ENTER_YOUR_ACCOUNT_ID>:coverage/emr-job-execution-policy 
    --role-name=emr-job-execution-irsa 
    --namespace=data-team-a 
    --approve

  5. Create an S3 bucket coverage to permit solely the execution position create in step 4 to write down and skim from the S3 bucket create in step 1. Ensure that to switch <ENTER_YOUR_ACCOUNT_ID> along with your account ID earlier than operating the command:
    cat > bucketpolicy.json<<EOL
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket",
                    "s3:PutObject",
                    "s3:DeleteObject",
                    "s3:AbortMultipartUpload",
                    "s3:ListMultipartUploadParts"
                ], "Principal": {
                    "AWS": "arn:aws:iam::<ENTER_YOUR_ACCOUNT_ID>:position/emr-job-execution-irsa"
                },
                "Useful resource": [
                    "arn:aws:s3:::<ENTER-S3-BUCKET>",
                    "arn:aws:s3:::<ENTER-S3-BUCKET>/*"
                ]
            }
        ]
    }
    EOL
    
    aws s3api put-bucket-policy --bucket ENTER-S3-BUCKET-NAME --policy file://bucketpolicy.json

  6. Create a Kubernetes position and position binding required for the service account used within the Spark job run:
    cat <<EOF >emr-job-execution-rbac.yaml
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    form: Position
    metadata:
      identify: emr-job-execution-sa-role
      namespace: data-team-a
    guidelines:
      - apiGroups: ["", "batch","extensions"]
        sources: ["configmaps","serviceaccounts","events","pods","pods/exec","pods/log","pods/portforward","secrets","services","persistentvolumeclaims"]
        verbs: ["create","delete","get","list","patch","update","watch"]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    form: RoleBinding
    metadata:
      identify: emr-job-execution-sa-rb
      namespace: data-team-a
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      form: Position
      identify: emr-job-execution-sa-role
    topics:
      - form: ServiceAccount
        identify: emr-job-execution-sa
        namespace: data-team-a
    EOF

  7. Apply the Kubernetes position and position binding definition with the next command:
kubectl apply -f emr-job-execution-rbac.yaml

To this point, we now have accomplished the infrastructure setup, together with the Spark job execution roles. Within the following steps, we run pattern Spark jobs utilizing each Spark Operator and spark-submit with the EMR runtime.

Configure the Spark Operator job with the EMR runtime

On this part, we current a pattern Spark job that reads knowledge from public datasets saved in S3 buckets, processes them, and writes the outcomes to your personal S3 bucket. Just remember to replace the S3 bucket within the following configuration by changing <ENTER_S3_BUCKET> with the URI to your personal S3 bucket refered in step 2 of the “Arrange the Spark job execution position and repair account part. Additionally, notice that we’re utilizing data-team-a as a namespace and emr-job-execution-sa as a service account, which we created within the earlier step. These are essential to run the Spark job pods within the devoted namespace, and the IAM position linked to the service account is used to entry the S3 bucket for studying and writing knowledge.

Most significantly, discover the picture discipline with the EMR optimized runtime Docker picture, which is presently set to emr-6.10.0. You’ll be able to change this to a more moderen model when it’s launched by the Amazon EMR crew. Additionally, when configuring your jobs, just be sure you embrace the sparkConf and hadoopConf settings as outlined within the following manifest. These configurations allow you to profit from EMR runtime efficiency, AWS Glue Knowledge Catalog integration, and the EMRFS optimized connector.

  1. Create the file (emr-spark-operator-example.yaml) regionally and replace the S3 bucket location to be able to submit the job as a part of the subsequent step:
    cat <<EOF >emr-spark-operator-example.yaml
    ---
    apiVersion: "sparkoperator.k8s.io/v1beta2"
    form: SparkApplication
    metadata:
      identify: taxi-example
      namespace: data-team-a
    spec:
      sort: Scala
      mode: cluster
      # EMR optimized runtime picture
      picture: "483788554619.dkr.ecr.eu-west-1.amazonaws.com/spark/emr-6.10.0:newest"
      imagePullPolicy: All the time
      mainClass: ValueZones
      mainApplicationFile: s3://aws-data-lake-workshop/spark-eks/spark-eks-assembly-3.3.0.jar
      arguments:
        - s3://nyc-tlc/csv_backup
        - "2017"
        - s3://nyc-tlc/misc/taxi _zone_lookup.csv
        - s3://<ENTER_S3_BUCKET>/emr-eks-results
        - emr_eks_demo
      hadoopConf:
        # EMRFS filesystem config
        fs.s3.customAWSCredentialsProvider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider
        fs.s3.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem
        fs.AbstractFileSystem.s3.impl: org.apache.hadoop.fs.s3.EMRFSDelegate
        fs.s3.buffer.dir: /mnt/s3
        fs.s3.getObject.initialSocketTimeoutMilliseconds: "2000"
        mapreduce.fileoutputcommitter.algorithm.model.emr_internal_use_only.EmrFileSystem: "2"
        mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem: "true"
      sparkConf:
        spark.eventLog.enabled: "true"
        spark.eventLog.dir: "s3://<ENTER_S3_BUCKET>/"
        spark.kubernetes.driver.pod.identify: driver-nyc-taxi-etl
        # Required for EMR Runtime and Glue Catalogue
        spark.driver.extraClassPath: /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/safety/conf:/usr/share/aws/emr/safety/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/residence/hadoop/extrajars/*
        spark.driver.extraLibraryPath: /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native
        spark.executor.extraClassPath: /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/safety/conf:/usr/share/aws/emr/safety/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/residence/hadoop/extrajars/*
        spark.executor.extraLibraryPath: /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native
        # EMRFS commiter
        spark.sql.parquet.output.committer.class: com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter
        spark.sql.parquet.fs.optimized.committer.optimization-enabled: "true"
        spark.sql.emr.inside.extensions: com.amazonaws.emr.spark.EmrSparkSessionExtensions
        spark.executor.defaultJavaOptions: -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseParallelGC -XX:InitiatingHeapOccupancyPercent=70 -XX:OnOutOfMemoryError="kill -9 %p"
        spark.driver.defaultJavaOptions:  -XX:OnOutOfMemoryError="kill -9 %p" -XX:+UseParallelGC -XX:InitiatingHeapOccupancyPercent=70
      sparkVersion: "3.3.1"
      restartPolicy:
        sort: By no means
      driver:
        cores: 1
        reminiscence: "4g"
        serviceAccount: emr-job-execution-sa
      executor:
        cores: 1
        cases: 4
        reminiscence: "4g"
        serviceAccount: emr-job-execution-sa
    EOF

  2. Run the next command to submit the job to the EKS cluster:
    kubectl apply -f emr-spark-operator-example.yaml

    The job could take 4–5 minutes to finish, and you may confirm the profitable message within the driver pod logs.

  3. Confirm the job by operating the next command:
kubectl get pods -n data-team-a

Allow entry to the Spark UI

The Spark UI is a vital software for knowledge engineers as a result of it permits you to monitor the progress of duties, view detailed job and stage data, and analyze useful resource utilization to determine bottlenecks and optimize your code. For Spark jobs operating on Kubernetes, the Spark UI is hosted on the driving force pod and its entry is restricted to the inner community of Kubernetes. To entry it, we have to ahead the site visitors to the pod with kubectl. The next steps take you thru easy methods to set it up.

Run the next command to ahead site visitors to the driving force pod:

kubectl port-forward <driver-pod-name> 4040:4040

It’s best to see textual content just like the next:

Forwarding from 127.0.0.1:4040 -> 4040
Forwarding from [::1]:4040 → 4040

If you happen to didn’t specify the driving force pod identify on the submission of the SparkApplication, you may get it with the next command:

kubectl get pods -l spark-role=driver,spark-app-name=<your-spark-app-name> -o jsonpath="{.objects[0].metadata.identify}"

Open a browser and enter http://localhost:4040 within the tackle bar. It’s best to be capable to connect with the Spark UI.

spark-ui-screenshot

Spark Historical past Server

If you wish to discover your job after its run, you’ll be able to view it by means of the Spark Historical past Server. The previous SparkApplication definition has the occasion log enabled and shops the occasions in an S3 bucket with the next path: s3://YOUR-S3-BUCKET/. For directions on organising the Spark Historical past Server and exploring the logs, seek advice from Launching the Spark historical past server and viewing the Spark UI utilizing Docker.

spark-submit

spark-submit is a command line interface for operating Apache Spark purposes on a cluster or regionally. It permits you to submit purposes to Spark clusters. The software allows easy configuration of software properties, useful resource allocation, and customized libraries, streamlining the deployment and administration of Spark jobs.

Starting with Amazon EMR 6.10, spark-submit is supported as a job submission methodology. This methodology presently solely helps cluster mode because the submission mechanism. To submit jobs utilizing the spark-submit methodology, we reuse the IAM position for the service account we arrange earlier. We additionally use the S3 bucket used for the Spark Operator methodology. The steps on this part take you thru easy methods to configure and submit jobs with spark-submit and profit from EMR runtime enhancements.

  1. With the intention to submit a job, it’s worthwhile to use the Spark model that matches the one out there in Amazon EMR. For Amazon EMR 6.10, it’s worthwhile to obtain the Spark 3.3 model.
  2. You additionally have to ensure you have Java put in in your atmosphere.
  3. Unzip the file and navigate to the foundation of the Spark listing.
  4. Within the following code, substitute the EKS endpoint in addition to the S3 bucket then run the script:
./bin/spark-submit 
--class ValueZones 
--master k8s://EKS-ENDPOINT 
--conf spark.kubernetes.namespace=data-team-a 
--conf spark.kubernetes.container.picture=608033475327.dkr.ecr.us-west-1.amazonaws.com/spark/emr-6.10.0:newest 
--conf spark.kubernetes.authenticate.driver.serviceAccountName=emr-job-execution-sa 
--conf spark.kubernetes.authenticate.executor.serviceAccountName=emr-job-execution-sa 
--conf spark.driver.extraClassPath="/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/safety/conf:/usr/share/aws/emr/safety/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/residence/hadoop/extrajars/*" 
--conf spark.driver.extraLibraryPath="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" 
--conf spark.executor.extraClassPath="/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/safety/conf:/usr/share/aws/emr/safety/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/residence/hadoop/extrajars/*" 
--conf spark.executor.extraLibraryPath="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" 
--conf spark.hadoop.fs.s3.customAWSCredentialsProvider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider 
--conf spark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem 
--conf spark.hadoop.fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3.EMRFSDelegate 
--conf spark.hadoop.fs.s3.buffer.dir=/mnt/s3 
--conf spark.hadoop.fs.s3n.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem 
--deploy-mode cluster 
s3://aws-data-lake-workshop/spark-eks/spark-eks-assembly-3.3.0.jar s3://nyc-tlc/csv_backup 2017 s3://nyc-tlc/misc/taxi_zone_lookup.csv s3://S3_BUCKET/emr-eks-results emr_eks_demo

The job takes about 7 minutes to finish with two executors of 1 core and 1 G of reminiscence.

Utilizing customized kubernetes schedulers

Prospects operating a big quantity of jobs concurrently may face challenges associated to offering honest entry to compute capability that they aren’t capable of resolve with the usual scheduling and useful resource utilization administration Kubernetes affords. As well as, prospects which can be migrating from Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) and are managing their scheduling with YARN queues will be unable to transpose them to Kubernetes scheduling capabilities.

To beat this difficulty, you should utilize customized schedulers like Apache Yunikorn or Volcano.Spark Operator natively helps these schedulers, and with them you’ll be able to schedule Spark purposes based mostly on elements reminiscent of precedence, useful resource necessities, and equity insurance policies, whereas Spark Operator simplifies software deployment and administration. To arrange Yunikorn with gang scheduling and use it in Spark purposes submitted by means of Spark Operator, seek advice from Spark Operator with YuniKorn.

Clear up

To keep away from undesirable fees to your AWS account, delete all of the AWS sources created throughout this deployment:

eksctl delete cluster -f eks-cluster.yaml

Conclusion

On this put up, we launched the EMR runtime characteristic for Spark Operator and spark-submit, and explored the benefits of utilizing this characteristic on an EKS cluster. With the optimized EMR runtime, you’ll be able to considerably improve the efficiency of your Spark purposes whereas optimizing prices. We demonstrated the deployment of the cluster utilizing the eksctl software, , you can too use the Knowledge on EKS blueprints for deploying a production-ready EKS which you should utilize for EMR on EKS and leverage these new deployment strategies along with the EMR on EKS API job submission methodology. By operating your purposes on the optimized EMR runtime, you’ll be able to additional improve your Spark software workflows and drive innovation in your knowledge processing pipelines.


In regards to the Authors

Lotfi Mouhib is a Senior Options Architect working for the Public Sector crew with Amazon Net Companies. He helps public sector prospects throughout EMEA understand their concepts, construct new providers, and innovate for residents. In his spare time, Lotfi enjoys biking and operating.

Vara Bonthu is a devoted expertise skilled and Worldwide Tech Chief for Knowledge on EKS, specializing in helping AWS prospects starting from strategic accounts to various organizations. He’s enthusiastic about open-source applied sciences, knowledge analytics, AI/ML, and Kubernetes, and boasts an in depth background in improvement, DevOps, and structure. Vara’s major focus is on constructing extremely scalable knowledge and AI/ML options on Kubernetes platforms, serving to prospects harness the total potential of cutting-edge expertise for his or her data-driven pursuits.

[ad_2]