tencent cloud

Stream Compute Service

Releases Notes and Announcements
Release Notes
Product Introduction
Overview
Strengths
Use Cases
Purchase Guide
Billing Overview
Billing Mode
Refund
Configuration Adjustments
Getting Started
Preparations
Creating a Private Cluster
Creating a SQL Job
Creating a JAR Job
Creating an ETL Job
Creating a Python Job
Operation Guide
Managing Jobs
Developing Jobs
Monitoring Jobs
Job Logs
Events and Diagnosis
Managing Metadata
Managing Checkpoints
Tuning Jobs
Managing Dependencies
Managing Clusters
Managing Permissions
SQL Developer Guide
Overview
Glossary and Data Types
DDL Statements
DML Statements
Merging MySQL CDC Sources
Connectors
SET Statement
Operators and Built-in Functions
Identifiers and Reserved Words
Python Developer Guide
ETL Developer Guide
Overview
Glossary
Connectors
FAQ
Contact Us

Job Failure

PDF
포커스 모드
폰트 크기
마지막 업데이트 시간: 2023-11-07 16:48:29

Overview

A job failure event in Stream Compute Service indicates that the status of a Flink job changes from running to failed or restarting, which may cause interrupted data processing, output delay in the downstream, and other issues.

Conditions

Trigger

1. The status of a Flink job changes from RUNNING to FAILED or RESTARTING. Later, the Flink JobManager will recover the job in about 10s, with the running instance ID after recovery remaining unchanged.
2. A Flink job is restarted too many times or too frequently, exceeding the limit (the threshold is generally controlled by restart-strategy.fixed-delay.attempts and defaults to 5, and we recommend you increase it in a production environment) given in the Restart Policies. This will result in the exit of both the JobManager and the TaskManagers, and the system will try to recover the job from the last successful checkpoint within about 2 minutes, with the running instance ID ‍after recovery increased by 1.

Clearing

After the Flink or Stream Compute Service system recovers the job back to RUNNING, a failed job recovery event will be generated, indicating the end of this event.

Alarms

You can configure an alarm policy for this event to receive trigger and clearing notifications in real time.

Suggestions

You can search for exception logs under the instance ID of the job for which the event is generated, as instructed in Diagnosis with Logs. Generally speaking, error messages before and after the keywords from RUNNING to FAILED contain the direct causes of the job failure. We recommend you analyze the issue based on these error messages together with the logs of the JobManager and the TaskManagers.
If the problem is still not found with the above diagnosis, please check as instructed in Viewing Monitoring Information whether resource overuse exists. You can focus on TaskManager CPU usage, heap memory usage, full GC count, full GC time, and other critical metrics to check whether exceptions exist.

도움말 및 지원

문제 해결에 도움이 되었나요?

피드백