How to solve the abnormal health status (RED, YELLOW) of ES cluster?

When facing an abnormal health status (RED or YELLOW) in an Elasticsearch (ES) cluster, follow these steps to troubleshoot and resolve the issue:

1. Understand the Health Status

RED: Indicates that some or all primary shards are unassigned, leading to potential data loss.
YELLOW: Means that all primary shards are assigned, but some replica shards are unassigned, which may affect read performance during failures.

2. Check Cluster Health

Use the following command to get detailed cluster health information:

GET /_cluster/health?pretty

This will show the overall health status and the number of unassigned shards.

3. Identify Unassigned Shards

To identify which shards are unassigned, run:

GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state

This command lists all shards, their state, and the reason for unassignment if applicable.

4. Common Causes and Solutions

Insufficient Nodes: If the cluster has fewer nodes than required for replica shards, replicas will remain unassigned.
- Solution: Add more nodes to the cluster or reduce the number of replicas temporarily.
Disk Space Issues: If a node runs out of disk space, shards may not be assigned.
- Solution: Free up disk space or add more storage to the affected nodes.
Misconfiguration: Incorrect shard allocation settings can cause issues.
- Solution: Review and adjust shard allocation settings in elasticsearch.yml.
Node Failures: If a node fails, its shards may become unassigned.
- Solution: Recover the failed node or manually reassign shards using:
```
POST /_cluster/reroute?retry_failed=true
```

5. Reassign Unassigned Shards

If shards are unassigned due to temporary issues, you can manually reassign them:

POST /_cluster/reroute
{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "your_index",
        "shard": 0,
        "node": "target_node",
        "accept_data_loss": true
      }
    }
  ]
}

Note: Use accept_data_loss: true only if you are certain about the data loss.

6. Monitor and Prevent Future Issues

Regularly monitor cluster health using tools like Kibana or custom scripts.
Set up alerts for critical health changes.
Ensure proper resource allocation and scaling strategies.

Example Scenario

If your cluster is RED due to unassigned primary shards, check the logs for errors. Suppose the issue is caused by insufficient nodes. Add a new node to the cluster and verify shard reassignment:

GET /_cluster/health?pretty

If the status changes to GREEN, the issue is resolved.

For enhanced monitoring and management of your ES cluster, consider using Tencent Cloud's Elasticsearch Service, which provides automated health monitoring, auto-scaling, and easy cluster management.