Releases Notes and Announcements

Release Notes

Product Introduction

Overview

Strengths

Use Cases

Purchase Guide

Billing Overview

Billing Mode

Refund

Configuration Adjustments

Getting Started

Preparations

Creating a Private Cluster

Creating a SQL Job

Creating a JAR Job

Creating an ETL Job

Creating a Python Job

Operation Guide

Managing Jobs

Developing Jobs

Monitoring Jobs

Job Logs

Events and Diagnosis

Managing Metadata

Managing Checkpoints

Tuning Jobs

Managing Dependencies

Managing Clusters

Managing Permissions

SQL Developer Guide

Overview

Glossary and Data Types

DDL Statements

DML Statements

Merging MySQL CDC Sources

Connectors

SET Statement

Operators and Built-in Functions

Identifiers and Reserved Words

Python Developer Guide

ETL Developer Guide

Overview

Glossary

Connectors

FAQ

Job Failure

PDF

포커스 모드

폰트 크기

마지막 업데이트 시간: 2023-11-07 16:48:29

Overview
A job failure event in Stream Compute Service indicates that the status of a Flink job changes from running to failed or restarting, which may cause interrupted data processing, output delay in the downstream, and other issues.
Conditions
Trigger
1. The status of a Flink job changes from RUNNING to FAILED or RESTARTING. Later, the Flink JobManager will recover the job in about 10s, with the running instance ID after recovery remaining unchanged.
2. A Flink job is restarted too many times or too frequently, exceeding the limit (the threshold is generally controlled by restart-strategy.fixed-delay.attempts and defaults to 5, and we recommend you increase it in a production environment) given in the Restart Policies. This will result in the exit of both the JobManager and the TaskManagers, and the system will try to recover the job from the last successful checkpoint within about 2 minutes, with the running instance ID ‍after recovery increased by 1.
Clearing
After the Flink or Stream Compute Service system recovers the job back to RUNNING, a failed job recovery event will be generated, indicating the end of this event.
Alarms
You can configure an alarm policy for this event to receive trigger and clearing notifications in real time.
Suggestions
You can search for exception logs under the instance ID of the job for which the event is generated, as instructed in Diagnosis with Logs. Generally speaking, error messages before and after the keywords from RUNNING to FAILED contain the direct causes of the job failure. We recommend you analyze the issue based on these error messages together with the logs of the JobManager and the TaskManagers.
If the problem is still not found with the above diagnosis, please check as instructed in Viewing Monitoring Information whether resource overuse exists. You can focus on TaskManager CPU usage, heap memory usage, full GC count, full GC time, and other critical metrics to check whether exceptions exist.

도움말 및 지원

문제 해결에 도움이 되었나요?

더 자세한 내용은 문의하기 또는 티겟 제출 을 통해 문의할 수 있습니다.

피드백

tencent cloud

Stream Compute Service

Job Failure

Overview

Conditions

Trigger

Clearing

Alarms

Suggestions

도움말 및 지원