Blog

April 12, 2024
in Backend, Golang
1 min read

Upgrade Golang to 1.21

日常升级.

Changes

Major
- [x] go.mod
- [x] Dockerfile
Minor
- [x] Use builtin slices, slog
- [x] Use builtin context.WithoutCause()

Check

[x] Build all apps' image
[x] Call a RPC which used context.WithoutCause()

About 1.21

https://tonybai.com/2023/08/20/some-changes-in-go-1-21/

分章分成四个部分:

语言本身
编译器与运行时
工具链
标准库

其中比较重要的有: loopvar, GC 优化, slog, 以及兼容性部分

Kimi

总结文章, 给出大纲以及重点, 使用 Markdown 的格式编写:

Go 1.21 版本更新概述:

语言变化
- builtin 包新增预定义函数：min、max 和 clear
- 明确了包初始化顺序算法
- 增强泛型的类型推断能力
- 修正 Go1 中的两个“陷阱”：panic(nil)和 for 循环变量
Go 编译器与运行时变化
- PGO 默认开启
- 降低 GC 尾部延迟
- 支持 WASI
Go 工具链更新
- 向后兼容：扩展和规范化 GODEBUG 的使用
- 向前兼容：go.mod 文件中的 go line 作为约束规则，引入 toolchain 指示符和 GOTOOLCHAIN 环境变量
Go 标准库更新
- log/slog：结构化日志支持
- slices、maps 和 cmp：泛型包加入标准库
- 其他变化：errors.ErrUnsupported、flag.BoolFunc、 net 的多路径 TCP 支持、 reflect.ValueOf 优化、sync.OnceFunc 等
参考资料
- Go 1.21 Release Notes
- Go 1.21 版本发布博客
- 兼容性相关博客
- Godebug 手册
- LoopvarExperiment 页面
- 如何 Golang Evolves without Breaking Programs 文章
- PGO 用户指南

这个大纲总结了文章中关于 Go 1.21 版本的主要更新和变化，包括语言特性的变更、编译器与运行时的改进、工具链的更新以及标准库的新增和改动。同时，列出了文章末尾提供的参考资料链接，方便读者进一步阅读和了解。

January 15, 2024
in DevOps, Database
5 min read

Migrate PostgreSQL Databases Using pglogical

Introduction

To reduce the latency of database queries, we have decided to migrate our Postgres database from one zone to a zone closer to our application.

TLDR

To minimize application downtime, we have chosen continuous migration using pglogical.

We need to modify the database configuration, enable the pglogical extension, and restart the instance. If you have multiple databases to migrate, please check worker registration failed for configuring max_worker_processes correctly.

TOML
# Ref https://github.com/2ndQuadrant/pglogical#quick-setup

wal_level = 'logical'
max_worker_processes = 16   # one per database needed on provider node
                            # one per node needed on subscriber node
max_replication_slots = 16  # one per node needed on provider node
max_wal_senders = 16        # one per node needed on provider node
shared_preload_libraries = 'pglogical'

track_commit_timestamp = on # needed for last/first update wins conflict resolution
                            # property available in PostgreSQL 9.5+

You can use SHOW to check the config parameter:

SQL
SHOW wal_level;
SHOW shared_preload_libraries;

DDL

To start with pglogical, it's crucial to ensure that the database structures of both the provider and subscriber are identical. We employ our goose based migration for this purpose. However when creating a subscription using pglogical.create_subscription(), you can set the synchronize_structure parameter to true, but it requres that the owner of these structures also exists in the subscriber.

Create subscription

On the provider node:

SQL
-- Create a user for logical replication
CREATE ROLE logicalrep WITH LOGIN;
ALTER USER logicalrep WITH PASSWORD 'your_pass';


-- Here we use pglogical extentsion
create extension pglogical;


-- The dsn can be blank
SELECT pglogical.create_node(
    node_name := 'provider1',
    dsn := ''
);

-- Add all tables in public schema to the default replication set
SELECT pglogical.replication_set_add_all_tables('default', ARRAY['public']);

-- Grant privileges to the logicalrep user
GRANT USAGE ON schema public TO logicalrep;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO logicalrep;
GRANT USAGE ON schema pglogical TO logicalrep;
GRANT SELECT ON ALL TABLES IN SCHEMA pglogical TO logicalrep;

On the subscriber node:

Here we sync the order database.

SQL
CREATE EXTENSION pglogical;
SELECT pglogical.create_node(
    node_name := 'sub',
    dsn := 'dbname=order'
);

-- Create subscription
SELECT pglogical.create_subscription(
    subscription_name := 'sub_order',
    provider_dsn := 'host=your_host port=5432 dbname=order user=logicalrep password=your_pass'
);

-- Check it
--
-- Expected status is `replicating`.
-- If not, check the database log for details (provider or subscriber node).
select * from pglogical.show_subscription_status();

Sync sequences

One common mistake people make is forgetting the last values of sequences, which can lead to duplicated key errors.

We write a script to do this. The new last value will be 1.1 times the origin, factoring in subscription latency. The primary SQL statement is:

SQL
-- Get last values
SELECT * FROM pg_sequences;

-- Apply them to subscriber
ALTER SEQUENCE public.videos_id_seq RESTART new_value;

-- To check the last value of altered sequences, you can use the following SQL.
-- Note that `SELECT * FROM pg_sequences` will show NULL last value after you alter it.
-- Refer to https://gist.github.com/lbbedendo/449ff46d3baa7838b99ec513c2de92a7
SELECT last_value FROM public.videos_id_seq;

Check rows

We've crafted a script to verify the equality of both the row count and a randomly selected subset of rows. See: https://gist.github.com/ionling/10f50bf3d77040fa8bb4f6695c23befe

Drop subsription

Once we confirm that the source and target databases are in a consistent state, we can proceed to drop the subscription.

SQL
SELECT * FROM pglogical.drop_subscription('sub_order');

Solutions

Here, we will explore several solutions for the migration process.

Dump and restore

Dumping the old database and restoring it to the new database is a viable approach. However, it comes with the drawback of causing a significant downtime, which is undesirable. During this downtime, tasks such as database restoration, data verification, and service restarts are necessary, consuming a considerable amount of time.

Streaming replication

PostgreSQL offers native support for streaming replication, facilitating continuous synchronization of data from a source database to a new one. This approach significantly reduces service downtime. By configuring replication in advance, the migration process only requires restarting services, making it more efficient and minimizing downtime.

Multi-master

The best solution is to combine the old and new databases in a multi-master cluster. This allows us to smoothly migrate any service that relies on the old database to the new without any downtime.

One implementation for achieving this is through the spock extension. However, the performance and potential issues of this extension might be a concern. For more insights, you can refer to the blog post: How to achieve multi-master replication in PostgreSQL with Spock

Summary

To strike a balance between complexity and functionality, we ultimately opted for the streaming replication solution.

AWS has a blog on database upgrading that is also useful for migrating:

Notes

Could not open relation with OID

When debugging the subscription, I mistakenly dropped the pglogical extension, deleted the pglogical schema, and subsequently recreated it. As a consequence, I encounter a error when running certain pg commands:

Text Only
1	`Could not open relation with OID`

To resolve it, just reconnect the db.

See https://github.com/2ndQuadrant/pglogical/issues/347

PostgreSQL priveledges

Some priveledges are not intuitive, like this:

SQL
GRANT ALL ON DATABASE mydb TO admin;

grants privileges on the database itself, not things within the database. admin can now drop the database, still without being able to create tables in schema public.^1

ERROR: worker registration failed

When creating a subscription, we encountered the following error:

Text Only
1	`ERROR: worker registration failed, you might want to increase max_worker_processes setting`

According to the pglogical README:

one process per node is needed on the subscriber node.

After ensuring that we do not exceed the limit, we started searching for related documents and encountered two conflicting pieces of information.

One, from EnterpriseDB, states:

One per database + two per subscription on the subscriber (downstream).

The other, from GitHub Issue #7, mentions:

One worker needed for the main maintenance process, one per subscription, one per replicated database, and during startup, it can need one per any connectable database on top of that.

If we have two databases with one subscription for each database:

According to quote one, we will need 2*1 + 2*2 = 6 processes.
According to quote two, we will need 2*2 = 4 processes.

Which one is correct? I don't know.

Other tools

References

November 30, 2023
in Backend, Container, DevOps
2 min read

Fix `docker exec` operation not permitted

Issue

服务器执行 docker exec 失败

Text Only
1 2	`# docker exec -it user-ab107i9 sh OCI runtime exec failed: exec failed: unable to start container process: open /dev/pts/0: operation not permitted: unknown`

搜索一下发现 Issue:

https://github.com/moby/moby/issues/43969

Issue 中提到这个问题在 runc v1.1.4 中已经修复. 检查一下 runc 版本, 为 1.1.3, ok, 确认为 runc 的问题. 想着要升级 docker 版本会造成服务停止问题, 就没处理.

直到今天 [2023-11-30 Thu] 又想起这个问题, 搜了一下, 发现阿里云的文章提到可以避免业务中断的方法^aliyun, 尝试了一下, 没有问题.

Solution

Bash
## Download runc binary
# https://github.com/opencontainers/runc/releases
wget https://github.com/opencontainers/runc/releases/download/v1.1.10/runc.amd64
wget https://github.com/opencontainers/runc/releases/download/v1.1.10/runc.amd64.asc
wget https://raw.githubusercontent.com/opencontainers/runc/main/runc.keyring

## Verify runc binary
gpg --import runc.keyring
gpg --verify runc.amd64.asc runc.amd64

## Replace runc
docker info | grep runc         # Show old version info
sudo mv /usr/bin/runc /usr/bin/runc_old
sudo cp runc.amd64 /usr/bin/runc
sudo chmod +x /usr/bin/runc
docker info | grep runc         # Show new version info

这样就替换完成了, 对于有问题的容器直接重启就可以了.

About runc

runc 是最底层的容器运行时.

以最新版的 Docker 为例, 当我们执行运行容器时, Docker 会调用 containerd, containerd 再调用 runc, 最后通过 runc 来运行容器.

而 containerd 主要负责以下事情^qikqiak:

管理容器的生命周期（从创建容器到销毁容器）
拉取/推送容器镜像
存储管理（管理镜像及容器数据的存储）
调用 runc 运行容器（与 runc 等容器运行时交互）
管理容器网络接口及网络

架构:

containerd

November 10, 2023
in Backend, Event-Driven
2 min read

Introduction to the eventbus service

The eventbus service is the Event-Driven Architecture (EDA) implementation in adoba services.

Introduction

Here, Redis Stream data structure serves as the message queue. In other services, events are added to the stream. The eventbus service then consumes these messages and utilizes Remote Procedure Call (RPC) for communication with other services.

See:

sequenceDiagram
  wallet->>redis: Add event to stream
  eventbus->>redis: Read stream
  eventbus->>order: Call RPC

In the past, the prevailing practice was to develop a sidecar queue service, like strandq or studioq, for each individual service. However, this approach introduced redundancy and demanded repetitive effort for each service, making it both time-consuming and resource-intensive. Consequently, a more efficient strategy emerged: consolidating the handling of all events within the eventbus service, eliminating the need for separate management. This consolidation encapsulates the core concept of an "eventbus."

When interacting with an external message system, such as Aliyun MNS, having a dedicated standalone queue service (like strandq) proves advantageous. This design choice preserves the conciseness of the eventbus service and aligns with the single responsibility principle, ensuring clear and modular responsibilities.

Why Event-Driven

Let's talk about the drawbacks first. The big one is that event-driven architecture makes the system more complex, and we have to spend more time developing a new feature.

But advantages are more!

Advantages of Event-Driven Architecture:

Loose Coupling: EDA promotes loose coupling between different components of a system. Components communicate through events, and they don't need to be aware of each other's existence. This characteristic makes it easier to develop, test, and maintain individual components independently.
Enhanced Fault Tolerance: Since components in an event-driven system are decoupled, the failure of one component doesn't necessarily lead to the failure of others. Components can continue processing events even if some other parts of the system are temporarily unavailable. This enhanced fault tolerance ensures system stability during partial failures.
Event Replay: The advantage of replay is the ability to reprocess or reevaluate past events. This is useful for debugging, testing, correcting data, and recovering from errors or failures in a system.
Asynchronous Communication: Events in an event-driven system are often communicated asynchronously. This enables systems to handle high loads and peak traffic more effectively. Components can publish events without waiting for subscribers to process them immediately. Asynchronous communication contributes to the scalability and responsiveness of the system.
Improved Inter-Team Collaboration: When two services are controlled by different teams, EDA becomes an optimal choice, minimizing conflicts and arguments between teams. The loosely coupled nature of EDA allows teams to work more independently without directly impacting each other, fostering smoother collaboration and reducing potential friction.

Design Patterns

Several design patterns align with Event-Driven Architecture, further enhancing its suitability for complex systems. Notable patterns include: