Monitor Apache Spark purposes on Amazon EMR with Amazon Cloudwatch

Big Data

Monitor Apache Spark purposes on Amazon EMR with Amazon Cloudwatch

lohitnath.453

August 31, 2023

Monitor Apache Spark purposes on Amazon EMR with Amazon Cloudwatch

[ad_1]

To enhance a Spark software’s effectivity, it’s important to observe its efficiency and conduct. On this publish, we reveal the best way to publish detailed Spark metrics from Amazon EMR to Amazon CloudWatch. This will provide you with the flexibility to establish bottlenecks whereas optimizing useful resource utilization.

CloudWatch offers a strong, scalable, and cost-effective monitoring answer for AWS sources and purposes, with highly effective customization choices and seamless integration with different AWS companies. By default, Amazon EMR sends primary metrics to CloudWatch to trace the exercise and well being of a cluster. Spark’s configurable metrics system permits metrics to be collected in quite a lot of sinks, together with HTTP, JMX, and CSV recordsdata, however further configuration is required to allow Spark to publish metrics to CloudWatch.

Resolution overview

This answer consists of Spark configuration to ship metrics to a {custom} sink. The {custom} sink collects solely the metrics outlined in a Metricfilter.json file. It makes use of the CloudWatch agent to publish the metrics to a {custom} Cloudwatch namespace. The bootstrap motion script included is liable for putting in and configuring the CloudWatch agent and the metric library on the Amazon Elastic Compute Cloud (Amazon EC2) EMR cases. A CloudWatch dashboard can present immediate perception into the efficiency of an software.

The next diagram illustrates the answer structure and workflow.

The workflow consists of the next steps:

Customers begin a Spark EMR job, making a step on the EMR cluster. With Apache Spark, the workload is distributed throughout the completely different nodes of the EMR cluster.
In every node (EC2 occasion) of the cluster, a Spark library captures and pushes metric information to a CloudWatch agent, which aggregates the metric information earlier than pushing them to CloudWatch each 30 seconds.
Customers can view the metrics accessing the {custom} namespace on the CloudWatch console.

We offer an AWS CloudFormation template on this publish as a common information. The template demonstrates the best way to configure a CloudWatch agent on Amazon EMR to push Spark metrics to CloudWatch. You’ll be able to overview and customise it as wanted to incorporate your Amazon EMR safety configurations. As a finest apply, we advocate together with your Amazon EMR safety configurations within the template to encrypt information in transit.

You also needs to bear in mind that among the sources deployed by this stack incur prices once they stay in use. Moreover, EMR metrics don’t incur CloudWatch prices. Nevertheless, {custom} metrics incur prices based mostly on CloudWatch metrics pricing. For extra data, see Amazon CloudWatch Pricing.

Within the subsequent sections, we undergo the next steps:

Create and add the metrics library, set up script, and filter definition to an Amazon Easy Storage Service (Amazon S3) bucket.
Use the CloudFormation template to create the next sources:
Monitor the Spark metrics on the CloudWatch console.

Conditions

This publish assumes that you’ve got the next:

An AWS account.
An S3 bucket for storing the bootstrap script, library, and metric filter definition.
A VPC created in Amazon Digital Non-public Cloud (Amazon VPC), the place your EMR cluster will likely be launched.
Default IAM service roles for Amazon EMR permissions to AWS companies and sources. You’ll be able to create these roles with the aws emr create-default-roles command within the AWS Command Line Interface (AWS CLI).
An elective EC2 key pair, for those who plan to connect with your cluster by means of SSH fairly than Session Supervisor, a functionality of AWS Methods Supervisor.

Outline the required metrics

To keep away from sending pointless information to CloudWatch, our answer implements a metric filter. Assessment the Spark documentation to get acquainted with the namespaces and their related metrics. Decide which metrics are related to your particular software and efficiency targets. Totally different purposes might require completely different metrics to observe, relying on the workload, information processing necessities, and optimization targets. The metric names you’d like to observe must be outlined within the Metricfilter.json file, together with their related namespaces.

We now have created an instance Metricfilter.json definition, which incorporates capturing metrics associated to information I/O, rubbish assortment, reminiscence and CPU strain, and Spark job, stage, and activity metrics.

Word that sure metrics aren’t accessible in all Spark launch variations (for instance, appStatus was launched in Spark 3.0).

Create and add the required recordsdata to an S3 bucket

For extra data, see Importing objects and Putting in and working the CloudWatch agent in your servers.

To create and the add the bootstrap script, full the next steps:

On the Amazon S3 console, select your S3 bucket.
On the Objects tab, select Add.
Select Add recordsdata, then select the Metricfilter.json, installer.sh, and examplejob.sh recordsdata.
Moreover, add the emr-custom-cw-sink-0.0.1.jar metrics library file that corresponds to the Amazon EMR launch model you may be utilizing:
1. EMR-6.x.x
2. EMR-5.x.x
Select Add, and be aware of the S3 URIs for the recordsdata.

Provision sources with the CloudFormation template

Select Launch Stack to launch a CloudFormation stack in your account and deploy the template:

This template creates an IAM position, IAM occasion profile, EMR cluster, and CloudWatch dashboard. The cluster begins a primary Spark instance software. You can be billed for the AWS sources used for those who create a stack from this template.

The CloudFormation wizard will ask you to change or present these parameters:

InstanceType – The sort of occasion for all occasion teams. The default is m5.2xlarge.
InstanceCountCore – The variety of cases within the core occasion group. The default is 4.
EMRReleaseLabel – The Amazon EMR launch label you need to use. The default is emr-6.9.0.
BootstrapScriptPath – The S3 path of the installer.sh set up bootstrap script that you simply copied earlier.
MetricFilterPath – The S3 path of your Metricfilter.json definition that you simply copied earlier.
MetricsLibraryPath – The S3 path of your CloudWatch emr-custom-cw-sink-0.0.1.jar library that you simply copied earlier.
CloudWatchNamespace – The identify of the {custom} CloudWatch namespace for use.
SparkDemoApplicationPath – The S3 path of your examplejob.sh script that you simply copied earlier.
Subnet – The EC2 subnet the place the cluster launches. You have to present this parameter.
EC2KeyPairName – An elective EC2 key pair for connecting to cluster nodes, as an alternative choice to Session Supervisor.

View the metrics

After the CloudFormation stack deploys efficiently, the instance job begins routinely and takes roughly quarter-hour to finish. On the CloudWatch console, select Dashboards within the navigation pane. Then filter the checklist by the prefix SparkMonitoring.

The instance dashboard consists of data on the cluster and an outline of the Spark jobs, phases, and duties. Metrics are additionally accessible underneath a {custom} namespace beginning with EMRCustomSparkCloudWatchSink.

Reminiscence, CPU, I/O, and extra activity distribution metrics are additionally included.

Lastly, detailed Java rubbish assortment metrics can be found per executor.

Clear up

To keep away from future prices in your account, delete the sources you created on this walkthrough. The EMR cluster will incur prices so long as the cluster is energetic, so cease it if you’re accomplished. Full the next steps:

On the CloudFormation console, within the navigation pane, select Stacks.
Select the stack you launched (EMR-CloudWatch-Demo), then select Delete.
Empty the S3 bucket you created.
Delete the S3 bucket you created.

Conclusion

Now that you’ve got accomplished the steps on this walkthrough, the CloudWatch agent is working in your cluster hosts and configured to push Spark metrics to CloudWatch. With this characteristic, you possibly can successfully monitor the well being and efficiency of your Spark jobs working on Amazon EMR, detecting vital points in actual time and figuring out root causes rapidly.

You’ll be able to bundle and deploy this answer by means of a CloudFormation template like this instance template, which creates the IAM occasion profile position, CloudWatch dashboard, and EMR cluster. The supply code for the library is on the market on GitHub for personalisation.

To take this additional, think about using these metrics in CloudWatch alarms. You may gather them with different alarms right into a composite alarm or configure alarm actions corresponding to sending Amazon Easy Notification Service (Amazon SNS) notifications to set off event-driven processes corresponding to AWS Lambda capabilities.

In regards to the Writer

Le Clue Lubbe is a Principal Engineer at AWS. He works with our largest enterprise clients to unravel a few of their most advanced technical issues. He drives broad options by means of innovation to influence and enhance the lifetime of our clients.

[ad_2]