tencent cloud

Elasticsearch Service

User Guide
Release Notes and Announcements
Release Notes
Product Announcements
Security Announcement
Product Introduction
Overview
Elasticsearch Version Support Notes
Features
Elastic Stack (X-Pack)
Strengths
Scenarios
Capabilities and Restrictions
Related Concepts
Purchase Guide
Billing Overview
Pricing
Elasticsearch Service Serverless Pricing
Notes on Arrears
ES Kernel Enhancement
Kernel Release Notes
Targeted Routing Optimization
Compression Algorithm Optimization
FST Off-Heap Memory Optimization
Getting Started
Evaluation of Cluster Specification and Capacity Configuration
Creating Clusters
Accessing Clusters
ES Serverless Guide
Service Overview
Basic Concepts
5-Minute Quick Experience
Quick Start
Access Control
Writing Data
Data Query
Index Management
Alarm Management
ES API References
Related Issues
Data Application Guide
Data Application Overview
Data Management
Elasticsearch Guide
Managing Clusters
Access Control
Multi-AZ Cluster Deployment
Cluster Scaling
Cluster Configuration
Plugin Configuration
Monitoring and Alarming
Log Query
Data Backup
Upgrade
Practical Tutorial
Data Migration and Sync
Use Case Construction
Index Configuration
SQL Support
Receiving Watcher Alerts via WeCom Bot
API Documentation
History
Introduction
API Category
Instance APIs
Making API Requests
Data Types
Error Codes
FAQs
Product
ES Cluster
Service Level Agreement
Glossary
New Version Introduction
Elasticsearch Service July 2020 Release
Elasticsearch Service February 2020 Release
Elasticsearch Service December 2019 Release
문서Elasticsearch ServiceElasticsearch Guide Monitoring and AlarmingSuggestions for Configuring Monitors and Alarms

Suggestions for Configuring Monitors and Alarms

PDF
포커스 모드
폰트 크기
마지막 업데이트 시간: 2024-12-03 17:58:22
ES not only provides a number of monitoring metrics for running ES clusters to monitor their health, but also allows you to configure alarms for key metrics, so that you can identify cluster problems and address them in a timely manner. For more information, see Viewing Monitoring Metrics and Configuring Alarms. This document describes some metrics that require special attention during your use of an ES cluster, as well as recommended alarm configurations:
Metric
Suggested Alarm Configuration
Description
Cluster health status
The statistical period is 1 minute. If this value is >= 1 in 5 consecutive periods, an alarm will be triggered once every 30 minutes
Value range:
0: Green, which indicates that all primary and replica shards are available and the cluster is in the healthiest status.
1: Yellow, which indicates that all the primary shards are available, but some replica shards are unavailable. In this case, the search results are still complete; however, the high availability of the cluster is affected to some extent, and there is a high risk of data loss.
2: Red, which indicates that at least one primary shard and all its replicas are unavailable. When the cluster health status changes to red, some data has become unavailable, the search can only return partial data, and the requests allocated to a lost shard will return an exception.
The cluster health status is the most direct manifestation of the current health of the cluster. If it changes to yellow or red, you should troubleshoot and repair the problem in a timely manner to prevent data loss or service unavailability.
Avg disk utilization
The statistical period is 1 minute. If this value is > 80% in 5 consecutive periods, an alarm will be triggered once every 30 minutes
The avg disk utilization refers to the average of the disk utilization values of all nodes in the cluster. If the disk utilization of a node is too high, the node will not have sufficient disk capacity to accommodate the shards allocated to it, leading to failures in basic operations such as index creating and document adding. You are recommended to promptly clear the data or scale out your cluster when this value is above 75%.
Avg JVM memory utilization
The statistical period is 1 minute. If this value is > 85% in 5 consecutive periods, an alarm will be triggered once every 30 minutes
The avg JVM memory utilization refers to the average of the JVM memory utilization values of all nodes in the cluster. A too high JVM memory utilization can lead to rejection of read and write operations, frequent GC, or even OOM. When this value exceeds the threshold, you are recommended to upgrade the node specification through vertical scaling.
Avg CPU utilization
The statistical period is 1 minute. If this value is > 90% in 5 consecutive periods, an alarm will be triggered once every 30 minutes
The avg CPU utilization refers to the average of the CPU utilization values of all nodes in the cluster. A too high average CPU utilization can lead to a decline in the processing capability of the cluster nodes or even downtime. If this value is too high, you should upgrade the node specification or reduce the number of requests based on the current node configuration of your cluster and your business.
Bulk rejection rate
The statistical period is 1 minute. If this value is > 0% in one period, an alarm will be triggered once every 30 minutes
The bulk rejection rate refers to the percentage of rejected bulk operations in all bulk operations performed by your cluster during a single period. When this value is greater than 0%, i.e., one or more bulk rejections have occurred, your cluster has reached the upper limit of its capability to process bulk operations, or an exception has occurred. In this case, you should troubleshoot and repair the problem in a timely manner; otherwise, bulk operations will be affected, or data loss will occur.
Query rejection rate
The statistical period is 1 minute. If this value is > 0% in one period, an alarm will be triggered once every 30 minutes
The query rejection rate refers to the percentage of rejected query operations in all query operations performed by your cluster during a single period. When this value is greater than 0%, i.e., one or more query rejections have occurred, your cluster has reached the upper limit of its capability to process query operations, or an exception has occurred. In this case, you should troubleshoot and repair the problem in a timely manner; otherwise, query operations will be affected.

도움말 및 지원

문제 해결에 도움이 되었나요?

피드백