Skip to content

DevOps

Migrate PostgreSQL Databases Using pglogical

Introduction

To reduce the latency of database queries, we have decided to migrate our Postgres database from one zone to a zone closer to our application.

TLDR

To minimize application downtime, we have chosen continuous migration using pglogical.

Change config

We need to modify the database configuration, enable the pglogical extension, and restart the instance. If you have multiple databases to migrate, please check worker registration failed for configuring max_worker_processes correctly.

TOML
# Ref https://github.com/2ndQuadrant/pglogical#quick-setup

wal_level = 'logical'
max_worker_processes = 16   # one per database needed on provider node
                            # one per node needed on subscriber node
max_replication_slots = 16  # one per node needed on provider node
max_wal_senders = 16        # one per node needed on provider node
shared_preload_libraries = 'pglogical'

track_commit_timestamp = on # needed for last/first update wins conflict resolution
                            # property available in PostgreSQL 9.5+

You can use SHOW to check the config parameter:

SQL
SHOW wal_level;
SHOW shared_preload_libraries;

DDL

To start with pglogical, it's crucial to ensure that the database structures of both the provider and subscriber are identical. We employ our goose based migration for this purpose. However when creating a subscription using pglogical.create_subscription(), you can set the synchronize_structure parameter to true, but it requres that the owner of these structures also exists in the subscriber.

Create subscription

On the provider node:

SQL
-- Create a user for logical replication
CREATE ROLE logicalrep WITH LOGIN;
ALTER USER logicalrep WITH PASSWORD 'your_pass';


-- Here we use pglogical extentsion
create extension pglogical;


-- The dsn can be blank
SELECT pglogical.create_node(
    node_name := 'provider1',
    dsn := ''
);

-- Add all tables in public schema to the default replication set
SELECT pglogical.replication_set_add_all_tables('default', ARRAY['public']);

-- Grant privileges to the logicalrep user
GRANT USAGE ON schema public TO logicalrep;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO logicalrep;
GRANT USAGE ON schema pglogical TO logicalrep;
GRANT SELECT ON ALL TABLES IN SCHEMA pglogical TO logicalrep;

On the subscriber node:

Here we sync the order database.

SQL
CREATE EXTENSION pglogical;
SELECT pglogical.create_node(
    node_name := 'sub',
    dsn := 'dbname=order'
);

-- Create subscription
SELECT pglogical.create_subscription(
    subscription_name := 'sub_order',
    provider_dsn := 'host=your_host port=5432 dbname=order user=logicalrep password=your_pass'
);

-- Check it
--
-- Expected status is `replicating`.
-- If not, check the database log for details (provider or subscriber node).
select * from pglogical.show_subscription_status();

Sync sequences

One common mistake people make is forgetting the last values of sequences, which can lead to duplicated key errors.

We write a script to do this. The new last value will be 1.1 times the origin, factoring in subscription latency. The primary SQL statement is:

SQL
-- Get last values
SELECT * FROM pg_sequences;

-- Apply them to subscriber
ALTER SEQUENCE public.videos_id_seq RESTART new_value;

-- To check the last value of altered sequences, you can use the following SQL.
-- Note that `SELECT * FROM pg_sequences` will show NULL last value after you alter it.
-- Refer to https://gist.github.com/lbbedendo/449ff46d3baa7838b99ec513c2de92a7
SELECT last_value FROM public.videos_id_seq;

Check rows

We've crafted a script to verify the equality of both the row count and a randomly selected subset of rows. See: https://gist.github.com/ionling/10f50bf3d77040fa8bb4f6695c23befe

Drop subsription

Once we confirm that the source and target databases are in a consistent state, we can proceed to drop the subscription.

SQL
SELECT * FROM pglogical.drop_subscription('sub_order');

Solutions

Here, we will explore several solutions for the migration process.

Dump and restore

Dumping the old database and restoring it to the new database is a viable approach. However, it comes with the drawback of causing a significant downtime, which is undesirable. During this downtime, tasks such as database restoration, data verification, and service restarts are necessary, consuming a considerable amount of time.

Streaming replication

PostgreSQL offers native support for streaming replication, facilitating continuous synchronization of data from a source database to a new one. This approach significantly reduces service downtime. By configuring replication in advance, the migration process only requires restarting services, making it more efficient and minimizing downtime.

Multi-master

The best solution is to combine the old and new databases in a multi-master cluster. This allows us to smoothly migrate any service that relies on the old database to the new without any downtime.

One implementation for achieving this is through the spock extension. However, the performance and potential issues of this extension might be a concern. For more insights, you can refer to the blog post: How to achieve multi-master replication in PostgreSQL with Spock

Summary

To strike a balance between complexity and functionality, we ultimately opted for the streaming replication solution.

AWS has a blog on database upgrading that is also useful for migrating:

  1. https://aws.amazon.com/cn/blogs/database/part-1-upgrade-your-amazon-rds-for-postgresql-database-comparing-upgrade-approaches/
  2. https://aws.amazon.com/cn/blogs/database/part-2-upgrade-your-amazon-rds-for-postgresql-database-using-the-pglogical-extension/

Notes

Could not open relation with OID

When debugging the subscription, I mistakenly dropped the pglogical extension, deleted the pglogical schema, and subsequently recreated it. As a consequence, I encounter a error when running certain pg commands:

Text Only
Could not open relation with OID

To resolve it, just reconnect the db.

See https://github.com/2ndQuadrant/pglogical/issues/347

PostgreSQL priveledges

Some priveledges are not intuitive, like this:

SQL
GRANT ALL ON DATABASE mydb TO admin;

grants privileges on the database itself, not things within the database. admin can now drop the database, still without being able to create tables in schema public.^1

ERROR: worker registration failed

When creating a subscription, we encountered the following error:

Text Only
ERROR: worker registration failed, you might want to increase max_worker_processes setting

According to the pglogical README:

one process per node is needed on the subscriber node.

After ensuring that we do not exceed the limit, we started searching for related documents and encountered two conflicting pieces of information.

One, from EnterpriseDB, states:

One per database + two per subscription on the subscriber (downstream).

The other, from GitHub Issue #7, mentions:

One worker needed for the main maintenance process, one per subscription, one per replicated database, and during startup, it can need one per any connectable database on top of that.

If we have two databases with one subscription for each database:

  • According to quote one, we will need 2*1 + 2*2 = 6 processes.
  • According to quote two, we will need 2*2 = 4 processes.

Which one is correct? I don't know.

Other tools

References

Fix docker exec operation not permitted

Issue

服务器执行 docker exec 失败

Text Only
# docker exec -it user-ab107i9 sh
OCI runtime exec failed: exec failed: unable to start container process: open /dev/pts/0: operation not permitted: unknown

搜索一下发现 Issue:

https://github.com/moby/moby/issues/43969

Issue 中提到这个问题在 runc v1.1.4 中已经修复. 检查一下 runc 版本, 为 1.1.3, ok, 确认为 runc 的问题. 想着要升级 docker 版本会造成服务停止问题, 就没处理.

直到今天 [2023-11-30 Thu] 又想起这个问题, 搜了一下, 发现阿里云的文章提到可以避免业务中断的方法^aliyun, 尝试了一下, 没有问题.

Solution

Bash
## Download runc binary
# https://github.com/opencontainers/runc/releases
wget https://github.com/opencontainers/runc/releases/download/v1.1.10/runc.amd64
wget https://github.com/opencontainers/runc/releases/download/v1.1.10/runc.amd64.asc
wget https://raw.githubusercontent.com/opencontainers/runc/main/runc.keyring

## Verify runc binary
gpg --import runc.keyring
gpg --verify runc.amd64.asc runc.amd64

## Replace runc
docker info | grep runc         # Show old version info
sudo mv /usr/bin/runc /usr/bin/runc_old
sudo cp runc.amd64 /usr/bin/runc
sudo chmod +x /usr/bin/runc
docker info | grep runc         # Show new version info

这样就替换完成了, 对于有问题的容器直接重启就可以了.

About runc

runc 是最底层的容器运行时.

以最新版的 Docker 为例, 当我们执行运行容器时, Docker 会调用 containerd, containerd 再调用 runc, 最后通过 runc 来运行容器.

containerd 主要负责以下事情^qikqiak:

  1. 管理容器的生命周期(从创建容器到销毁容器)
  2. 拉取/推送容器镜像
  3. 存储管理(管理镜像及容器数据的存储)
  4. 调用 runc 运行容器(与 runc 等容器运行时交互)
  5. 管理容器网络接口及网络

架构:

containerd