replication-manager 3.1: Major Features from 3.1.17 to 3.1.22

replication-manager 3.1 — What's New from 3.1.17 to 3.1.22

The 3.1 branch has matured significantly over six releases. This post summarizes the major user-visible features shipped between 3.1.17 and 3.1.22, covering schema monitoring, a production-ready Restic backup pipeline, table checksums, and operational tooling for DBAs and SRE teams.

Schema Monitoring and Drift Detection

Starting in 3.1.17, replication-manager monitors schema differences between master and replicas — tables, columns, and indexes. The feature integrates with the shardproxy layer and surfaces divergences before they become replication failures or application inconsistencies.

3.1.21 significantly hardened this feature:

Schema scans now bulk-load metadata with structured CRCs, reducing lock pressure on the monitored servers
A configurable scan timeout prevents runaway metadata queries from impacting production traffic
A password leak in schema scan logs was eliminated
The scan is skipped automatically when shardproxy is not in use, removing unnecessary overhead (3.1.19)

Why it matters: Schema drift between master and replicas is a common and silent source of replication failures. Detecting it proactively — before a query fails — gives operators time to act.

Table Checksum and Repair

3.1.18 introduced table checksum scheduling with schema cache requirements and bounded polling. 3.1.21 and 3.1.22 extended the feature substantially:

Multi-column primary key support with correct range conditions and chunk min/max key validity
Prepared statements for checksum efficiency and forced Primary Key scan path
Deterministic shard table sorting by size
Per-chunk repair operations (3.1.22): repair divergent chunks directly from the default schema using a temporary table
ACL support for repair-all operations
Alerts on repair completion

Why it matters: Checksumming at scale requires correctness with composite keys and efficiency under load. The repair workflow closes the loop — detect divergence, then fix it, without manual intervention.

Restic Backup Pipeline

The Restic integration received the most development activity across this release cycle. The pipeline is now production-ready for both local and cloud-backed deployments.

S3 Backend Enhancements (3.1.22)

Configurable AWS endpoint and prefix for private S3-compatible stores
Force-init support for existing repositories
Per-repo append mode control
Append mode correctly bypassed for AWS S3 backends

Environment Variable Overrides (3.1.19)

Additional environment variable overrides allow fine-grained control of S3 and custom backend configuration without modifying config files — useful in containerized deployments where secrets are injected at runtime.

Progress Tracking (3.1.22)

Progress bar display in task tracking
API progress exposure for integration with external monitoring tools
Legacy snapshots API format supported via ?format=legacy for backward-compatible tooling

Cross-Platform Mount Support (3.1.21)

Restic unmount now works across platforms. Mount directories are admin-selectable under strict path checks, eliminating the permission conflicts that affected earlier versions.

Human-Readable Backup Listings

Backup sizes are now displayed in human-readable format in both the API responses and the GUI table views.

Why it matters: A backup system that cannot be monitored or integrated is a liability. Progress visibility, S3 flexibility, and human-readable output reduce the operational gap between "backups configured" and "backups trusted."

Splitdump: Logical Backup Sharding

3.1.17 introduced the splitdump CLI tool. 3.1.18 integrated it fully into the cluster backup workflows.

Key capabilities:

Configurable shard size with UI control and per-cluster defaults
GTID handling during restore, ensuring replicas reconnect correctly after a splitdump restore
Missing table resilience: restores continue past missing tables rather than aborting
Binlog suppression during restore to avoid event amplification
Preamble helpers and schema file identification for correct restore ordering
Logical reseed async flow: hardened restore workflows for safer re-seeding of replicas
pgzip tuning exposure: configurable parallel compression for performance on large datasets
Local backup deletion to reclaim disk space after a restore

Why it matters: Large logical backups are difficult to parallelize and recover from selectively. Splitdump addresses both problems — sharded dumps enable parallel restore and selective table recovery without third-party tools.

Variable Override and Configuration Drift

A three-tier preserved variables system allows controlled manual overrides of database variables with UI indicators. The HasConfigDiff field and the Variables tab let operators see at a glance which variables deviate from expected configuration.

Why it matters: In production clusters, variables drift over time — applied hotfixes, emergency changes, vendor recommendations. Tracking that drift inside the orchestrator gives DBAs a single place to audit and correct it.

Replication Master Retry Count

The replication master retry count is now configurable from both the configuration file and the UI. This controls how many times a replica attempts to reconnect to the master before declaring it lost.

Why it matters: In unstable network environments, the default retry count leads to premature failovers. Tuning it per-cluster avoids unnecessary topology changes.

Docker Rootless Images

All Docker image variants now ship rootless counterparts running as the repman non-root user with fixed UID/GID 10001:10001.

To use rootless images with volume mounts:

sudo chown -R 10001:10001 /path/to/data /path/to/config

Why it matters: Container security requirements in regulated or multi-tenant environments mandate non-root execution. Fixed UID/GID ensures consistent permissions across hosts and rebuilds.

Bug Fixes Worth Noting

MyDumper compatibility: version-aware flag handling across MyDumper versions — no more silent restore failures in mixed environments
Arbitrator startup: no longer fails when the config backup directory is absent
SQL injection prevention: dbhelper package rewritten with parameterized queries and a vendor abstraction layer
Mattermost CVE-2025-12421: account takeover vulnerability fixed
TLS config registration: correct name used in mysql.RegisterTLSConfig
gRPC protobuf field collision in the Index struct resolved (3.1.21)
Restic purge deadlock eliminated (3.1.18)
Analyze table SQL injection hardening: qualified identifiers quoted, multi-dot names rejected (3.1.19)

Upgrade Notes

Docker rootless: ensure volume ownership before switching to rootless images.

Restic S3: the new endpoint and prefix fields are optional. Existing configurations continue to work without changes.

Splitdump: shard size defaults are applied per-cluster. Review the new UI control to align with your storage and restore SLA requirements.

Schema monitoring: scan timeout is configurable. Set it based on your largest table sizes to avoid monitoring loop delays.

replication-manager 3.1.22 is available on GitHub. Packages for RPM and DEB distributions are available through the standard release pipeline.