Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-36151] Add schema evolution related docs #3575

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions docs/content.zh/docs/core-concept/schema-evolution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
---
title: "Schema Evolution"
weight: 7
type: docs
aliases:
- /core-concept/schema-evolution/
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# 定义

**Schema Evolution** 功能可以用于将上游的 DDL 变更事件同步到下游,例如创建新表、添加新列、重命名列或更改列类型、删除列、截断和删除表等。

## 参数

Schema Evolution 的行为可以通过配置以下参数来设定:

```yaml
pipeline:
schema.change.behavior: evolve
```

`schema.change.behavior` 是一个枚举类型,可以被设定为 `exception`、`evolve`、`try_evolve`、`lenient`、或 `ignore`。

## Schema Evolution 行为

### Exception 模式

在此模式下,所有结构变更行为均不被允许。
一旦收到表结构变更事件,`SchemaOperator` 就会抛出异常。
当您的下游接收器不能处理任何架构更改时,可以使用此模式。

### Evolve 模式

在此模式下,`SchemaOperator` 会将所有上游架构更改事件应用于下游接收器。
如果尝试失败,则会从 `SchemaRegistry` 抛出异常并触发全局的故障重启。

### TryEvolve 模式

在此模式下,架构运算符还将尝试将上游架构更改事件应用于下游接收器。
但是,如果下游接收器不支持特定的架构更改事件并报告失败,
`SchemaOperator` 会容忍这一事件,并且在出现上下游表结构差异的情况下,尝试转换所有后续数据记录。

> 警告:此类数据转换和转换不能保证无损。某些数据类型不兼容的字段可能会丢失。

### Lenient 模式

在此模式下,架构操作员将在转换所有上游架构更改事件后将其转换为下游接收器,以确保不会丢失任何数据。
例如,`AlterColumnTypeEvent` 将被转换为两个单独的架构更改事件 `RenameColumnEvent` 和 `AddColumnEvent`:
保留上一列(具有更改前的类型),并添加一个新列(具有新类型)。

这是默认的架构演变行为。

> 注意:在此模式下,`TruncateTableEvent` 和 `DropTableEvent` 默认不会被发送到下游,以避免意外的数据丢失。这一行为可以通过配置 [Per-Event Type Control](#per-event-type-control) 调整。

### Ignore 模式

在此模式下,所有架构更改事件都将被 `SchemaOperator` 默默接收,并且永远不会尝试将它们应用于下游接收器。
当您的下游接收器尚未准备好进行任何架构更改,但想要继续从未更改的列中接收数据时,这很有用。

## 按类型配置行为

有时,将所有架构更改事件同步到下游可能并不合适。
例如,允许 `AddColumnEvent` 但禁止 `DropColumnEvent` 是一种常见的情况,可以避免删除已有的数据。
这可以通过在 `sink` 块中设置 `include.schema.changes` 和 `exclude.schema.changes` 选项来实现。

### 选项

| Option Key | 注释 | 是否可选 |
|--------------------------|-------------------------------------------------|------|
| `include.schema.changes` | 要应用的结构变更事件类型。如果未指定,则默认包含所有类型。 | 是 |
| `exclude.schema.changes` | 不希望应用的结构变更事件类型。其优先级高于 `include.schema.changes`。 | 是 |

> 在 Lenient 模式下,`TruncateTableEvent` 和 `DropTableEvent` 默认会被忽略。在任何其他模式下,默认不会忽略任何事件。

以下是可配置架构变更事件类型的完整列表:

| 事件类型 | 注释 |
|---------------------|--------------|
| `add.column` | 向表中追加一列。 |
| `alter.column.type` | 变更某一列的数据类型。 |
| `create.table` | 创建一张新表。 |
| `drop.column` | 删除某一列。 |
| `drop.table` | 删除某张表。 |
| `rename.column` | 修改某一列的名字。 |
| `truncate.table` | 清除某张表中的全部数据。 |

支持部分匹配。例如,将 `drop` 传入上面的选项相当于同时传入 `drop.column` 和 `drop.table`。

### 例子

下面的 YAML 配置设置为包括 `CreateTableEvent` 和列相关事件,但 `DropColumnEvent` 除外。

```yaml
sink:
include.schema.changes: [create.table.event, column]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create.table ------ create.table.event

What are the differences between them?

I noticed that the enumerated value in the documentation is create.table, but in the example, it is given as create.table.event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my mistake, create.table should be used instead of create.table.event.

exclude.schema.changes: [drop.column]
```
113 changes: 113 additions & 0 deletions docs/content/docs/core-concept/schema-evolution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
title: "Schema Evolution"
weight: 7
type: docs
aliases:
- /core-concept/schema-evolution/
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Definition

**Schema Evolution** feature could synchronize upstream schema DDL changes to downstream, including creating new table, appending new columns, renaming columns or changing column types, dropping columns, truncating and dropping tables.

## Parameters

Schema evolution behavior could be specified with the following pipeline option:

```yaml
pipeline:
schema.change.behavior: evolve
```

`schema.change.behavior` is of enum type, and could be set to `exception`, `evolve`, `try_evolve`, `lenient` or `ignore`.

## Behaviors

### Exception Mode

In this mode, all schema change behaviors are forbidden. An exception will be thrown from `SchemaOperator` once it was captured.
This is useful when your downstream sink is not expected to handle any schema changes.

### Evolve Mode

In this mode, CDC pipeline schema operator will apply all upstream schema change events to downstream sink.
If the attempt fails, an exception will be thrown from the `SchemaRegistry` and trigger a global failover.

### TryEvolve Mode

In this mode, schema operator will also try to apply upstream schema change events to downstream sink.
However, if specific schema change events are not supported by downstream sink, the failure will be tolerated and `SchemaOperator` will try to convert all following data records in case of schema discrepancy.

> Warning: such data casting and converting isn't guaranteed to be lossless. Some fields with incompatible data types might be lost.

### Lenient Mode

In this mode, schema operator will convert all upstream schema change events to downstream sink after converting them to ensure no data will be lost.
For example, an `AlterColumnTypeEvent` will be converted to two individual schema change events including `RenameColumnEvent` and `AddColumnEvent`:
Previous column (with the unchanged type) will be kept and a new column (with the new type) will be added.

This is the default schema evolution behavior.

> Notice: In this mode, `TruncateTableEvent` and `DropTableEvent` will not be sent to downstream to avoid unexpected data loss. Such behavior could be overridden by [Per-Event Type Control](#per-event-type-control).

### Ignore Mode

In this mode, all schema change events will be silently swallowed by `SchemaOperator` and never attempt to apply them to downstream sink.
This is useful when your downstream sink is unready for any schema changes, but wants to keep receiving data from unchanged columns.

## Per-Event Type Control

Sometimes, it may not be suitable to synchronize all schema change events to downstream.
For example, allowing `AddColumnEvent` but disallowing `DropColumnEvent` is a common scenario to avoid deleting existing data.
This could be achieved by setting `include.schema.changes` and `exclude.schema.changes` option in `sink` block.

### Options

| Option Key | meaning | optional/required |
|--------------------------|-----------------------------------------------------------------------------------------------------------|-------------------|
| `include.schema.changes` | Schema change event types to be included. Include all types by default if not specified. | optional |
| `exclude.schema.changes` | Schema change event types **not** to be included. It has a higher priority than `include.schema.changes`. | optional |

> In Lenient mode, `TruncateTableEvent` and `DropTableEvent` will be ignored by default. In any other mode, no events will be ignored by default.

Here's a full list of configurable schema change event types:

| Event Type | Description |
|---------------------|------------------------------|
| `add.column` | Add a new column to a table. |
| `alter.column.type` | Change the type of column. |
| `create.table` | Create a new table. |
| `drop.column` | Drop a column. |
| `drop.table` | Drop a table. |
| `rename.column` | Rename a column. |
| `truncate.table` | Truncate a table. |

Partial matching is supported. For example, passing `drop` into the options above is equivalent to passing `drop.column` and `drop.table`.

### Example

The following YAML configuration is set to include `CreateTableEvent` and column related events, except `DropColumnEvent`.

```yaml
sink:
include.schema.changes: [create.table.event, column]
exclude.schema.changes: [drop.column]
```
Loading