Amazon CloudWatch metrics for Amazon OpenSearch Service storage and shard skew well being

Big Data

Amazon CloudWatch metrics for Amazon OpenSearch Service storage and shard skew well being

lohitnath.453

August 22, 2023

Amazon CloudWatch metrics for Amazon OpenSearch Service storage and shard skew well being

[ad_1]

Amazon OpenSearch Service is a managed service that makes it simple to deploy, function, and scale OpenSearch clusters in AWS to carry out interactive log analytics, real-time software monitoring, web site search, and extra. OpenSearch is an open supply, distributed search and analytics suite.

When working with OpenSearch Service, shard technique is vital. Shards distribute your workload throughout the information nodes of your cluster. When creating an index, you inform OpenSearch Service what number of major shards to create and what number of replicas to create of every shard. The first shards are impartial partitions of the total dataset. OpenSearch Service routinely distributes your information throughout the first shards in an index. Our suggestion is to make use of two replicas on your index. For instance, in case you set your index’s shard rely to a few major shards and two replicas, you’ll have a complete of 9 shards. Correctly configured indexes may also help increase total area efficiency, whereas a misconfigured index will result in storage and efficiency skew.

OpenSearch Service distributes the shards in your indexes to the information nodes in your area, making certain that no major shard and its replicas are positioned on the identical node. The info for the shards are saved within the node’s storage. In case your indexes (and due to this fact their shards) are very totally different sizes, the storage used on the information nodes within the area will likely be unequal, or skewed. Storage skew results in uneven reminiscence and CPU utilization, intermittent and uneven latency, and uneven queueing and rejecting of requests. Subsequently, it’s vital to configure and preserve indexes such that shards might be distributed evenly throughout the information nodes of your cluster.

On this put up, we discover methods to deploy Amazon CloudWatch metrics utilizing an AWS CloudFormation template to watch an OpenSearch Service area’s storage and shard skew. This answer makes use of an AWS Lambda operate to extract storage and shard distribution metadata out of your OpenSearch Service area, calculates the extent of skew, after which pushes this data to CloudWatch metrics so to simply monitor, alert, and reply.

Resolution overview

The answer and related assets can be found so that you can deploy into your individual AWS account as a CloudFormation template. The template deploys the next assets:

An AWS Identification and Entry Administration (IAM) position for the Lambda operate known as OpensearchSkewMetricsLambdaRole. This enables write entry to CloudWatch metrics and entry to the CloudWatch log group and OpenSearch APIs.
An AWS Lambda operate known as Opensearch-SkewMetricsPublisher-py.
An Amazon CloudWatch log group for the Lambda operate known as /aws/lambda/Opensearch-skewmetrics-publisher-py.
An Amazon EventBridge rule for the Lambda operate known as EventRuleForOSSkew.
The next CloudWatch metrics for the Lambda operate:
- aws_/<region-name>/<MetricIdentifier>/_storagemetric
- aws_/<region-name>/<MetricIdentifier>/_shardmetric

Conditions

For this walkthrough, you must have the next conditions:

An AWS account.
An OpenSearch Service area.
This put up requires you so as to add a Lambda position to the OpenSearch Service area’s safety configuration entry coverage. In case your area is utilizing fine-grained entry management, then it is advisable to observe the steps as described within the part Mapping roles to customers to allow entry for the newly deployed Lambda execution position to the area after deploying the CloudFormation template.

Deploy the CloudFormation template

To deploy the CloudFormation template, full the next steps:

Log in to your AWS account.
Choose the Area the place you’re operating your OpenSearch Service area.
To launch your CloudFormation stack, select Launch Stack
For Stack title, enter a reputation for the stack (most size 30 characters).
For MetricIdentifier, enter a singular identifier that can enable you to determine the customized CloudWatch metrics on your area.
For OpensearchDomainURL, enter the area endpoint that you’re monitoring.
Select Subsequent.
Choose I acknowledge that AWS CloudFormation would possibly create IAM assets, then select Create stack.
Look ahead to the stack creation to finish.
On the Lambda console, select Capabilities within the navigation pane.
Select the Lambda operate known as Opensearch-SkewMetricsPublisher-py-<stackname>.
Within the Code part, select Check.
Maintain the default values for the check occasion and run a fast check.

Ensure to grant the Lambda execution position permission to the OpenSearch Service area’s resource-based coverage, if you’re utilizing one. If fine-grained entry management is enabled on the area, then observe the steps in Mapping roles to customers (as talked about within the conditions) to permit the Lambda operate to learn from the area in read-only entry.

The Lambda operate that sends OpenSearch area metrics to CloudWatch is about to a default frequency of 1 day. You may change this configuration to watch the area on the required granularity by updating the occasion schedule for the rule deployed by the CloudFormation stack on the EventBridge console. Notice that if the frequency is about to 1 minute, this may set off the Lambda operate each minute and can improve the Lambda price.

This answer makes use of the cat/allocation API, which offers the variety of information nodes within the area together with every information node’s variety of shards and storage utilization attributes. For additional particulars on area storage and shard skew, consult with Node shard and storage skew. The Lambda operate processes and types every information node’s storage and shard skew from the typical worth. Any information node’s skew above 10% from the typical is mostly thought-about to be considerably skewed. This can begin to affect CPU, community, and disk bandwidth utilization as a result of the nodes with the very best storage utilization are usually the resource-strained nodes, whereas nodes with lower than 10% utilization signify underutilized capability.

Confer with Demystifying Elasticsearch shard allocation for particulars associated to shard dimension and shard rely technique. Basically, we suggest protecting shard sizes between 10–30 GB for workloads the place search latency is a key efficiency goal and 30–50 GB for write-heavy workloads. For shard rely, we suggest sustaining index shard counts which are divisible by the information node rely. For added particulars, consult with Sizing Amazon OpenSearch Service domains and Shard technique.

View skew metrics in CloudWatch

After you run this answer in your account, it’s going to create two CloudWatch metrics for monitoring. To entry these CloudWatch metrics, use the next steps:

On the CloudWatch console, below Metrics within the navigation pane, select All metrics.
Select Browse and choose Customized namespaces. It’s best to see two customized metrics ending with _storageworkspace and _shardworkspace, respectively.
Select both of the customized metrics after which choose NodeID.
On the record of node IDs, choose all of the nodes displayed within the record, and the graph will likely be plotted routinely.

You may hover the mouse over the plotted strains to see the node skew data.

The next screenshots present examples of how the CloudWatch metrics will seem on the console.

The storage skew metrics will likely be much like the next screenshot. Storage skew metrics reveals the area storage skew. For those who hover over the graph, it reveals the node record with obtainable nodes within the area. This record is sorted by the storage dimension (largest to smallest). The Lambda operate will periodically put up the newest storage skew outcomes.

The shard skew metrics will likely be much like the next screenshot. Shard skew metrics present the area shard skew. For those who hover over the graph, it reveals the node record with obtainable nodes within the area. This record is sorted by the shard dimension (largest to smallest). The Lambda operate will periodically put up the newest storage skew outcomes.

Storage skew happens when a number of nodes throughout the area has considerably extra storage than different nodes. The CloudWatch metric will present larger deviation of storage utilization for these nodes vs. different nodes. Equally, shard skew happens when a number of nodes has considerably extra shards than others nodes. The CloudWatch metric will present larger deviation for these nodes vs. different nodes within the area. When the area storage or shard skew is detected, you may increase a help case to work with the AWS group for remediation actions. See How do I rebalance the uneven shard distribution in my Amazon OpenSearch Service cluster for data on methods to take remediation actions to configure your area shard technique for optimum efficiency.

Prices

The price related to utilizing this answer could be minimal, round few cents per thirty days because it generates CloudWatch metrics. The answer additionally runs Lambda code, and on this case the Lambda features make API calls. For pricing particulars, consult with Amazon CloudWatch Pricing and AWS Lambda Pricing.

Clear up

For those who determine that you simply now not need to maintain the Lambda operate and related assets, you may navigate to the AWS CloudFormation console, select the stack, and select Delete.

If you wish to add the CloudWatch skew monitor metrics mechanism again in at any level, you may create the stack once more from the CloudFormation template.

Conclusion

You should use this answer to get a greater understanding of your OpenSearch Service area’s storage and shard skew to enhance its efficiency and probably decrease the price of working your area. See Use Elasticsearch’s _rollover API For environment friendly storage distribution for extra particulars associated to shard allocation and environment friendly storage distribution technique.

In regards to the authors

Nikhil Agarwal is Sr. Technical Supervisor with Amazon Internet Providers. He’s enthusiastic about serving to prospects obtain operational excellence of their cloud journey and dealing exercise on technical options. He’s additionally AI/ML enthusiastic and deep dives into buyer’s ML-specific use circumstances. Exterior of labor, he enjoys touring with household and exploring totally different devices.

Karthik Chemudupati is a Principal Technical Account Supervisor (TAM) with AWS, centered on serving to prospects obtain price optimization and operational excellence. He has greater than 19 years of IT expertise in software program engineering, cloud operations and automations. Karthik joined AWS in 2016 as a TAM and labored with greater than dozen Enterprise Prospects throughout US-West. Exterior of labor, he enjoys spending time along with his household.

Gene Alpert is a Senior Analytics Specialist with AWS Enterprise Assist. He has been centered on our Amazon OpenSearch Service prospects and ecosystem for the previous three years. Gene joined AWS in 2017. Exterior of labor he enjoys mountain biking, touring, and taking part in Inhabitants:One in VR.

[ad_2]