Automating Safe Postgres/Alembic Database Migrations

February 2, 2025;

Background

From 2022 to 2025, I worked at a fintech company called Pinwheel. Pinwheel’s goal as a company is to help banks and financial institutions reduce time and friction in user activation. Historically, the main product for doing this has been Pinwheel’s direct deposit switch solution. With Pinwheel, an end user can easily find the payroll account associated with their employer and move their direct deposit from the old financial institution to the new bank. After the user logs into their payroll account through Pinwheel, Pinwheel uses that authenticated session to retrieve a variety of payroll data, including employment, identity, income, paystubs, and shifts.

Problem

The payroll data described above is useful, but the Pinwheel product engineering team wanted to dig deeper and build more sophisticated products. These products (earnings stream, verification of income/employment reports, derived pay frequency) used payroll data primitives (paystubs, employment, etc.) to answer questions such as:

"When will this end user get paid, and how much?"
"Does this end user have stable employment?"

To do this, we needed to query our existing payroll data in new ways. Pinwheel stores data in a variety of formats, but its primary workhorse is AWS Aurora PostgreSQL, a managed relational database fully compatible with PostgreSQL. Supporting new query patterns meant introducing new database tables, columns, and indexes on existing columns—operations known as Data Definition Language (DDL) commands.

We managed these DDL operations using the Python library Alembic, which integrated well with our SQLAlchemy ORM and FastAPI backend. The workflow involved:

Generating a Python migration file using an Alembic command:

poetry run alembic revision --autogenerate -m "a description of the migration"

Defining the DDL operation in the migration file using Python code.
Storing the migration file in source control and applying it to different environments (staging, production) before application deployment.

Example Migration File

This is a representative example of one of these migration files.

"""
Revision ID: 518e44493d89
Revises: 2c17e948fc11
Create Date: 2025-02-02 21:30:35.955557+00:00
"""

from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql

revision = '518e44493d89'
down_revision = '2c17e948fc11'
branch_labels = None
depends_on = None

def upgrade():
    op.add_column('accounts', sa.Column('new_column', postgresql.UUID(), nullable=True))

def downgrade():
    op.drop_column('accounts', 'new_column')

Challenges with DDL Operations in Production

Frequent DDL operations on the production PostgreSQL database introduced potential risks. Index creation, in particular, had several subtleties. By default, creating an index on table A allows reads but blocks inserts, updates, and deletions (due to a SHARE lock - documented here).

To mitigate this, PostgreSQL provides an option for concurrent index creation (documented here). This feature is supported by SQLAlchemy/Alembic, but we found that:

There was no built-in way to mandate concurrent indexing.
Forgetting the concurrent flag wouldn’t cause issues locally or in staging—only in production.

Initially, the team relied on a manual code review checklist to ensure best practices, but over time, rigor decreased, and errors crept in. The result? Brief spikes in customer-visible API errors during index creation.

Automating Database Migration Validation

To eliminate human error, we decided to codify these checks into a CI/CD step. We aimed to answer the question:

"Is this database migration safe to apply to production, according to our best practices?"

Design Choices

We considered building a full parser for Alembic migrations but opted for a simpler heuristic-based approach to speed up implementation. The linter:

Identifies new migration files by comparing the branch against main:

current_main_hash=$(git log -n 1 main --format="%H")
git diff --name-only "$current_main_hash" HEAD alembic > altered_alembic_files.txt

Extracts the upgrade() function from each migration file.
Scans for index creation by counting occurrences of create_index.
1. If none are found, the file is skipped.
Ensures each index creation is concurrent and formatted correctly:
1. create_index and postgresql_concurrently=True have the same number of occurrences.
2. COMMIT command exists and precedes index creation.
3. Index creation is not interleaved with any other DDL operations.

Example Checks

✅ Good Upgrade Function

def upgrade():
    op.execute('COMMIT;')
    op.create_index('ix_accounts_new_column', 'accounts', ['new_column'], unique=False, postgresql_concurrently=True)

❌ Bad Upgrade Functions

1. Non-concurrent index creation:

def upgrade():
    op.execute('COMMIT;')
    op.create_index('ix_accounts_new_column', 'accounts', ['new_column'], unique=False)

2. COMMIT occurs after index creation:

def upgrade():
    op.create_index('ix_accounts_new_column', 'accounts', ['new_column'], unique=False, postgresql_concurrently=True)
    op.execute('COMMIT;')

3. Another DDL operations occurs alongside index creation:

def upgrade():
    op.execute('COMMIT;')
    op.create_index('ix_accounts_new_column', 'accounts', ['new_column'], unique=False, postgresql_concurrently=True)
    op.add_column('accounts', sa.Column('even_newer_column', postgresql.UUID(), nullable=True))

Adoption and Impact

The linter was integrated into CircleCI, running on every commit. Since implementing the linter in February 2024, none of Pinwheel’s production incidents have been caused by database migrations.

Takeaways

Automating validation for database migrations improved developer productivity and system reliability by replacing an error-prone manual step. When evaluating similar tooling investments, consider:

Signal-to-noise ratio: The linter flagged genuine issues with very few false positives (I verified this by combing over months of CircleCI builds and comparing every single failure of the new linting step with the corresponding code commit).
Complexity of implementation: The linter was written as a few Python/shell scripts, and all the error messages were very legible to developers. This made it straightforward to understand both in operation and for expansion.
Bypass Mechanisms: I didn’t implement an #ignore flag but code reviewers could approve commenting out the checks in rare cases.

By eliminating manual review steps and enforcing best practices, we ensured that migrations were consistently safe, reducing both developer toil and potential production incidents.

TODO: Full CI/CD Script and Python Linter

To complement this post, I've included the full CI/CD shell script and Python linter. These scripts automate the validation of Alembic migrations and ensure best practices are followed before deployment.