tencent cloud

Exception Diagnosis
Last updated: 2025-07-09 14:52:25
Exception Diagnosis
Last updated: 2025-07-09 14:52:25
The exception diagnosis feature provides you with real-time performance monitoring, health inspections, and failure diagnosis, so that you can intuitively know the real-time operation status of database instances, locate newly appeared performance exceptions in real time.

Overview





Viewing Diagnosis Information

1. Log in to the DBbrain Console.
2. In the left sidebar, choose Performance Optimization.
3. Select the corresponding database type and instance ID at the top, and select the Exception Diagnosis tab.
4. On the right side of the page, select to view real-time or historical diagnosis information.

5. View the health score trend chart, diagnosed exception events, and instance architecture diagram within the selected timeline.
View health score trend chart
Click any time point on the trend chart to display the health score.



View diagnosis event bar chart
Hover over the diagnosis event bar chart to display information such as risk level, overview, and start/end time. Click the bar chart to enter the Event Details page to view information including event details, on-site descriptions, intelligent analysis, and optimization suggestions. For more information on viewing event details, see Exception Alarms.
Hover over the diagnosis event timeline and scroll up and down to zoom in/zoom out the timeline range.



View health score, replica set instance architecture diagram, or real-time SQL trend chart and real-time slow SQL trend chart of sharded instances
Note:
MongoDB replica set instance: It displays health score and real-time data of instance architecture diagram.
MongoDB sharded instance: It displays health score, real-time SQL trend chart, and real-time slow SQL trend chart.
Health score
It displays real-time health score. Click Details under the health score to enter the Health Report page to view the health score, score details, and health report.



Replica set instance architecture diagram
It displays the proxy and node architecture of the instance, location of the nodes with alarms. Hover over the corresponding node or proxy to display the corresponding metric average.
The real-time SQL trend chart and real-time slow SQL trend chart of sharded instance
Real-time SQL trend chart: It displays the number of requests for aggregate, command, count, delete, getmore, insert, read, and update.
Real-time slow SQL trend chart: It displays the number of requests over 100 ms and the maximum CPU utilization of the cluster.


Viewing Diagnosis Prompts

Diagnosis events are displayed in the following risk levels: Healthy, Note, Alarm, Serious, and Critical. DBbrain performs health inspections on the instance once every ten minutes.
1. Log in to the DBbrain Console.
2. In the left sidebar, choose Performance Optimization.
3. Select the corresponding instance ID at the top, and select the Exception Diagnosis tab.
4. On the right side of the page, select to view real-time or historical diagnosis information.
Real-Time: Select real-time to display the risk distribution and diagnosis details for the last three hours.
Historical: Select history to display the risk distribution and diagnosis details for the selected time period.
5. View the diagnosis prompts for the selected time range.



View diagnosis event details
In the Diagnosis Details, click the row of a specific event alarm or hover over the event alarm and click View to enter the Event Details page and view the event details.
Event details mainly include event details, on-site descriptions, intelligent analysis, and optimization suggestions. The event details displayed vary depending on the diagnosis type. Refer to the actual display.
Event Details: They include the diagnosis item, time time, risk level, and overview.
AI Insight: They display insight results of each node.
Description: They include problem snapshots and performance trends of the exception or health inspection events.
Optimization Suggestion: They provide optimization suggestions for exception diagnosis events.

Ignore/Unignore alarms
In the Diagnosis Details, hover over the event alarm and click Ignore to select Ignore this item or Ignore this type, and click OK. You can also ignore alarms on the Event Details page.
Note:
Only diagnosis item alarms that are not generated by health inspections can be ignored or unignored.
Ignore This: It means you can only ignore this alarm.
Ignore This Type: It means you can ignore exception alarms generated from the same root cause.
Ignored diagnosis events will be grayed out. To unignore, you can also click Unignore.

Detailed Description of Diagnosis Items

Diagnosis items related to intelligent diagnosis are categorized into four types: performance, availability, reliability, and maintainability. Each diagnosis item belongs to one category only.
Name of Diagnosis Items
Type of Diagnosis Items
Note:
Risk Level Classification
Connectivity
Availability
The database connection is abnormal, and cannot connect to the database instance.
Critical
High Read Queue
Performance
When performing read operations, there are many requests waiting to access the database.
Notice: Read waiting queue is greater than or equal to 64, and the duration is greater than or equal to 1 minute.
Warning: Read waiting queue is greater than or equal to 64, and the duration is greater than or equal to 10 minutes.
Severe: Read waiting queue is greater than or equal to 64, and the duration is greater than or equal to 30 minutes.
Critical: Read waiting queue is greater than or equal to 64, and the duration is greater than or equal to 60 minutes.
High Write Queue
Performance
When performing write operations, many requests are waiting to access the database.
Notice: Write waiting queue is greater than or equal to 64, and the duration is greater than or equal to 1 minute.
Warning: Write waiting queue is greater than or equal to 64, and the duration is greater than or equal to 10 minutes.
Severe: Write waiting queue is greater than or equal to 64, and duration is greater than or equal to 30 minutes.
Critical: Write waiting queue is greater than or equal to 64, and duration is greater than or equal to 60 minutes.

High Connections
Availability
The database has too many connections.
Notice: Connection utilization is greater than or equal to 60% and less than 70%.
Warning: Connection utilization is greater than or equal to 70% and less than 80%.
Severe: Connection utilization is greater than or equal to 80% and less than 90%.
Critical: Connection utilization is greater than or equal to 90%.
Slave Delay
Maintainability
Primary/secondary node data synchronization latency is too large.
Notice: Primary/secondary latency is greater than or equal to 1 minute and less than 10 minutes.
Warning: Primary/secondary latency is greater than or equal to 10 minutes and less than 30 minutes.
Severe: Primary/secondary latency is greater than or equal to 30 minutes and less than 60 minutes.
Critical: Primary/Secondary latency is greater than or equal to 60 minutes.
Low OpLog save Time
Maintainability
Oplog retention period is too long.
Notice: Oplog retention period
is greater than or equal to 120 minutes and less than 480 minutes.
Warning: Oplog retention period is greater than or equal to 60 minutes and less than 120 minutes.
Severe: Oplog retention period is greater than or equal to 30 minutes and less than 60 minutes.
Critical: Oplog retention period is less than 30 minutes.
High Cache Used
Performance
The memory cache utilization in the database is high.
Notice: WT cache utilization exceeds 95%, and the duration is 1 minute.
Warning: WT cache utilization exceeds 95%, and the duration is 5 minutes.
Severe: WT cache utilization exceeds 95%, and the duration is 10 minutes.
Critical: WT cache utilization exceeds 95%, and the duration is 30 minutes.
High Cache Dirty
Performance
There is a large amount of data in memory not written to the disk.
Notice: Cache Dirty exceeds 20%, and the duration is 1 minute.
Warning: Cache Dirty exceeds 20%, and the duration is 5 minutes.
Severe: Cache Dirty exceeds 20%, and the duration is 10 minutes.
Critical: Cache Dirty exceeds 20%, and the duration is 30 minutes.
High Inflow Traffic
Performance
The database received requests or data traffic exceeding its processing capability.
Notice: Node inbound traffic is greater than or equal to 800 MB and less than 1,000 MB.
Warning: Node inbound traffic is greater than or equal to 1,000 MB and less than 1,200 MB.
Severe: Node inbound traffic is greater than or equal to 1,200 MB and less than 1,500 MB.
Critical: Node inbound traffic is greater than or equal to 1,500 MB.
High Outflow Traffic
Performance
A node (such as the primary or secondary node) is sending excessive outbound data traffic.
Notice: Node outbound traffic is greater than or equal to 800 MB and less than 1,000 MB.
Warning: Node outbound traffic is greater than or equal to 1,000 MB and less than 1,200 MB.
Severe: Node outbound traffic is greater than or equal to 1,200 MB and less than 1,500 MB.
Critical: Node outbound traffic is greater than or equal to 1,500 MB.
High Disk Utilization
Availability
The disk utilization of the database instance is close to or has reached its maximum capacity.
Notice: Disk utilization is greater than or equal to 60% and less than 80%.
Warning: Disk utilization is greater than or equal to 80% and less than 90%.
Severe: Disk utilization is greater than or equal to 90% and less than 95%.
Critical: Disk utilization is greater than or equal to 95%.
High Memory Utilization
Availability
The memory utilization of the database instance is close to or has reached its maximum capacity.
Notice: Memory utilization is greater than or equal to 70% and less than 80%.
Warning: Memory utilization is greater than or equal to 80% and less than 90%.
Severe: Memory utilization is greater than or equal to 90% and less than 95%.
Critical: Memory utilization is greater than or equal to 95%.
High Cpu Utilization
Availability
The CPU utilization of the database instance is close to or has reached its maximum capacity.
Notice: CPU utilization is greater than or equal to 60% and less than 80%.
Warning: CPU utilization is greater than or equal to 80% and less than 90%.
Severe: CPU utilization is greater than or equal to 90% and less than 95%.
Critical: CPU utilization is greater than or equal to 95%.
Node OOM
Availability
The memory usage of a MongoDB instance or node has exceeded its configured limit.
Critical
Slow Query
Performance
Queries with long execution time that may affect database performance and response time.
Notice: Slow SQL occurs, and CPU utilization is greater than or equal to 40%.
Warning: Slow SQL occurs, and CPU utilization is greater than 40% and less than or equal to 60%.
Severe: Slow SQL occurs, and CPU utilization is greater than 60% and less than or equal to 80%.
Critical: Slow SQL occurs, and CPU utilization is greater than 80%.
Node High Running Session
Availability
The number of sessions connected to the database exceeds the system's capacity.
Notice: Active sessions are greater than or equal to 2,000 and less than 100,000.
Warning: Active sessions are greater than or equal to 100,000 and less than 400,000.
Severe: Active sessions are greater than or equal to 400,000 and less than 900,000.
Critical: Active sessions is greater than or equal to 900,000.
High node page heap
Availability
The amount of memory used exceeds expectations.
Notice
Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback