pg_rocket 🚀

Extract referentially complete data subsets from PostgreSQL by traversing foreign key relationships.

pg_rocket is a CLI tool that intelligently extracts data from PostgreSQL databases while maintaining referential integrity. Given a SQL query, it automatically discovers and fetches all related records across tables, producing a consistent subset you can use for testing, debugging, or data migration.

Why pg_rocket?

🔍 Debug production issues locally - Pull a specific tenant's data without copying entire databases
🧪 Realistic test data - Extract production subsets with all relationships intact
📦 Safe data migration - Move related records across databases with referential integrity guaranteed
🚀 Fast and deterministic - Batched queries with consistent, repeatable output

Features

✅ Automatic foreign key traversal (upward, downward, or bidirectional)
✅ Composite primary key support
✅ Self-referential foreign keys
✅ Multiple output formats (SQL INSERTs, JSON)
✅ Direct database-to-database transfer with --exec
✅ Upsert mode for successive runs (--upsert)
✅ JSONB null literal preservation
✅ Progress indicators and colored output
✅ Configurable row limits with safety checks
✅ Dry-run mode to preview extraction plan

Installation

Homebrew (macOS/Linux)

brew tap riyasyash/tools
brew install pg_rocket

From Binary

Download the latest release for your platform from GitHub Releases.

From Source

git clone https://github.com/riyasyash/pg_rocket.git
cd pg_rocket
make build
sudo make install  # Optional: installs to /usr/local/bin

Using Go Install

go install github.com/riyasyash/pg_rocket@latest

Requirements

Go 1.21 or later (if building from source)
PostgreSQL 12+ (source database)

Quick Start

# Set your source database
export PGROCKET_SOURCE="postgres://user:pass@localhost:5432/mydb"

# Inspect the foreign key graph
pg_rocket inspect

# Extract a user and all related data
pg_rocket pull --query "SELECT * FROM users WHERE id = 42" --out user_42.sql

# Direct database-to-database transfer
export PGROCKET_TARGET="postgres://admin:pass@localhost:5432/testdb"
pg_rocket pull --query "SELECT * FROM orders WHERE id = 1001" --exec --upsert

Commands

`pg_rocket pull`

Extract referentially complete data starting from a root SQL query.

Required

--query - SQL SELECT statement returning rows from a single table

Connection

--source - Source database DSN (default: $PGROCKET_SOURCE)
--target - Target database DSN for --exec mode (default: $PGROCKET_TARGET)

Traversal

--parents - Only traverse upward to parent records
--children table1,table2 - Only traverse specified child tables (comma-separated)
(default: full bidirectional traversal)

Output

--out filename - Write to file instead of stdout
--json - Output JSON instead of SQL INSERTs
--exec - Execute INSERTs directly to target database (requires confirmation)

Control

--upsert - Use ON CONFLICT DO UPDATE for idempotent successive runs (requires --exec)
--dry-run - Show extraction plan without executing
--max-rows N - Maximum rows to extract (default: 10000)
--force - Override row limit
--verbose - Print detailed traversal logs

`pg_rocket inspect`

Display the foreign key graph of your database.

pg_rocket inspect [--source DSN]

Example output:

users
  ↑ organizations (via org_id)
  ↓ posts (via author_id)
  ↓ comments (via user_id)

`pg_rocket version`

Display version information.

pg_rocket version

Usage Examples

Extract User with All Related Data

pg_rocket pull \
  --query "SELECT * FROM users WHERE email = 'alice@example.com'" \
  --out alice.sql \
  --verbose

Fetches:

The user record
Parent records (organization, etc.)
Child records (posts, comments, orders, etc.)
All transitively related data

Parents Only (Upward Traversal)

pg_rocket pull \
  --query "SELECT * FROM orders WHERE id = 1001" \
  --parents \
  --out order_1001.sql

Fetches order → user → organization (stops at root, no children).

Selective Children (Downward Traversal)

pg_rocket pull \
  --query "SELECT * FROM projects WHERE id = 5" \
  --children tasks,comments \
  --out project_5.sql

Fetches project + only tasks and comments (skips other child tables).

JSON Output

pg_rocket pull \
  --query "SELECT * FROM organizations WHERE id = 10" \
  --json \
  --out org_10.json

Produces structured JSON:

{
  "organizations": [
    {"id": 10, "name": "Acme Corp"}
  ],
  "users": [
    {"id": 1, "org_id": 10, "name": "Alice"},
    {"id": 2, "org_id": 10, "name": "Bob"}
  ]
}

Direct Database Transfer

# Production → Staging transfer
pg_rocket pull \
  --source "postgres://readonly@prod.db:5432/proddb?sslmode=require" \
  --target "postgres://admin@staging.db:5432/stagingdb?sslmode=require" \
  --query "SELECT * FROM tenants WHERE name = 'acme'" \
  --exec \
  --verbose

The tool will:

Extract data from production
Display summary and prompt for confirmation
Insert into staging with foreign key integrity validation

Successive Runs with Upsert

# First run - inserts data
pg_rocket pull \
  --query "SELECT * FROM users WHERE id = 42" \
  --exec

# Second run - updates existing, inserts new (no duplicate key errors)
pg_rocket pull \
  --query "SELECT * FROM users WHERE id = 42" \
  --exec \
  --upsert

Configuration

Connection Strings

Format:

postgres://[user[:password]@][host][:port][/database][?param=value]

Examples:

# Local development
postgres://myuser:mypass@localhost:5432/mydb

# Production with SSL (recommended)
postgres://user:pass@prod.com:5432/db?sslmode=require

# AWS RDS
postgres://master:pass@mydb.abc.us-east-1.rds.amazonaws.com:5432/mydb?sslmode=require

# With certificate verification
postgres://user:pass@host:5432/db?sslmode=verify-full&sslrootcert=/path/to/ca.crt

SSL/TLS Security

Mode	Security	Description
`disable`	None	No encryption
`allow`	Low	Tries non-SSL first
`prefer`	Medium	Tries SSL first (default)
`require`	Good	Requires SSL (recommended for production)
`verify-ca`	Better	SSL + certificate verification
`verify-full`	Best	SSL + cert + hostname verification

Production recommendation: Always use sslmode=require or higher.

Environment Variables

# Source database (for reading)
export PGROCKET_SOURCE="postgres://user:pass@host:5432/db?sslmode=require"

# Target database (for --exec mode)
export PGROCKET_TARGET="postgres://admin:pass@host:5432/targetdb?sslmode=require"

Priority: CLI flags (--source, --target) override environment variables.

Safety Features

Pre-Execution Validation

pg_rocket validates all commands before running:

✅ Source database must be specified (flag or env var)
✅ Target database required for --exec mode
✅ Source and target must be different (prevents accidental self-writes)
✅ --exec and --out are mutually exclusive
✅ --upsert requires --exec

User Confirmation for Database Writes

When using --exec, you'll see:

⚠️  DATABASE WRITE OPERATION
============================================================
Source:
  postgres://***@prod-server:5432/proddb

Target (will be modified):
  postgres://***@staging-server:5432/stagingdb

Data to be inserted:
  Tables: 7
  Total rows: 1,423
============================================================

Are you sure you want to INSERT this data into the target database? (yes/no):

Type yes or y to proceed. Any other response safely cancels.

Row Limits

Default: 10,000 rows. Override with --max-rows or --force:

# Custom limit
pg_rocket pull --query "SELECT * FROM users" --max-rows 50000

# No limit (use with caution)
pg_rocket pull --query "SELECT * FROM users" --force

Foreign Key Integrity Validation

Before inserting into the target database, pg_rocket validates that all foreign key references exist. If validation fails, no data is inserted.

How It Works

Query Analysis: Parses your SQL query using EXPLAIN to identify the base table
Schema Discovery: Extracts foreign key and primary key metadata from pg_catalog
Graph Traversal: Performs breadth-first search following FK relationships
Topological Sorting: Orders tables to respect dependencies (parents before children)
Output Generation: Produces SQL INSERTs or JSON with deterministic ordering

Traversal Algorithm

Root Query → [users WHERE id = 42]
    ↓
Fetch Parents (upward):
    organizations ← users.org_id
        ↓
    regions ← organizations.region_id
    
Fetch Children (downward):
    posts ← users.id (author_id)
        ↓
    comments ← posts.id (post_id)
    ↓
    orders ← users.id (user_id)

All discovered rows are:

Deduplicated by primary key
Sorted deterministically
Validated for FK integrity

Schema Requirements

✅ Supported

Tables with primary keys (single or composite)
Foreign key relationships in the public schema
Self-referential foreign keys (e.g., users.manager_id → users.id)
All standard PostgreSQL data types (including JSONB, arrays, UUIDs, etc.)

❌ Not Supported

Tables without primary keys
Multi-table cyclic foreign keys (A → B → C → A)
Cross-schema relationships (only public schema)

Compatibility Check

Run pg_rocket inspect to verify your schema is compatible.

Troubleshooting

"table does not have a primary key"

Solution: Add a primary key to the table:

ALTER TABLE my_table ADD PRIMARY KEY (id);

"cycle detected in table dependencies"

Problem: Multi-table FK cycle exists (e.g., A → B → B → A).

Solution: Use --parents or --children to specify traversal direction and break the cycle:

pg_rocket pull --query "SELECT * FROM table_a WHERE id = 1" --parents

"primary key column not included in SELECT"

Problem: Your query doesn't include the primary key column.

Solution: Include the PK or use SELECT *:

-- Bad
SELECT name FROM users WHERE id = 1

-- Good
SELECT * FROM users WHERE id = 1

"foreign key integrity violation"

Problem: Extracted data references parent records that weren't extracted.

Solution: Use full traversal (remove --parents or --children flags) or check your FK relationships.

Performance

Batching: Queries are automatically batched (500 rows per query)
Memory: All data is held in memory before output
Indexing: Ensure FK columns are indexed for optimal performance
Network: Uses connection pooling for efficient database communication

Benchmarks

Typical extraction performance (on a modern workstation):

Small datasets (<100 rows): <1 second
Medium datasets (1,000-10,000 rows): 2-10 seconds
Large datasets (50,000+ rows): 30+ seconds

Development

Building from Source

make build        # Compile binary
make install      # Install to /usr/local/bin
make clean        # Remove build artifacts

Running Tests

make test         # Run integration tests (requires Docker)

Tests use Docker Compose to spin up a PostgreSQL container with test schemas.

Project Structure

pg_rocket/
├── cmd/                    # CLI command definitions
│   ├── root.go            # Root cobra command
│   ├── pull.go            # Pull command (main logic)
│   ├── inspect.go         # Inspect command
│   └── version.go         # Version command
├── internal/
│   ├── db/                # Database layer
│   │   ├── connection.go  # Connection pooling
│   │   ├── metadata.go    # FK/PK extraction
│   │   └── explain.go     # Query analysis
│   ├── graph/             # FK graph structures
│   │   ├── types.go       # Graph data structures
│   │   └── topo.go        # Topological sorting
│   ├── extractor/         # Core extraction engine
│   │   ├── traversal.go   # BFS traversal
│   │   └── progress.go    # Progress tracking
│   └── output/            # Output writers
│       ├── sql_writer.go  # SQL INSERT generation
│       ├── json_writer.go # JSON output
│       └── executor.go    # Direct DB execution
├── test/
│   └── integration/       # Docker-based integration tests
│       ├── docker-compose.yml
│       ├── run_tests.sh
│       └── test_schemas/
└── Makefile

Advanced Usage

Handling Large Datasets

# Extract with higher limit and force flag
pg_rocket pull \
  --query "SELECT * FROM events WHERE created_at > '2024-01-01'" \
  --max-rows 100000 \
  --force \
  --out events_2024.sql

Production → Staging Sync

#!/bin/bash
# sync_tenant.sh - Sync a tenant from prod to staging

TENANT_ID=$1

pg_rocket pull \
  --source "postgres://readonly@prod.db:5432/proddb?sslmode=require" \
  --target "postgres://admin@staging.db:5432/stagingdb?sslmode=require" \
  --query "SELECT * FROM tenants WHERE id = '$TENANT_ID'" \
  --exec \
  --upsert \
  --verbose

Extracting Tenant Data

# Multi-tenant app: extract everything for tenant-123
pg_rocket pull \
  --query "SELECT * FROM tenants WHERE id = 'tenant-123'" \
  --out tenant_123.sql \
  --max-rows 50000 \
  --force

Date-Scoped Extraction

# Extract last week's orders with all related data
pg_rocket pull \
  --query "SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '7 days'" \
  --json \
  --out recent_orders.json

How Foreign Key Traversal Works

pg_rocket uses breadth-first search to discover related records:

Upward Traversal (Parents)

Follows FKs "up" to parent tables:

orders (id=1001) → users (id=42) → organizations (id=5)

Downward Traversal (Children)

Follows FKs "down" to child tables:

users (id=42) → posts (author_id=42) → comments (post_id=...)
               → orders (user_id=42) → line_items (order_id=...)

Full Traversal (Default)

Combines both directions to get the complete relational closure.

Cycle Handling

Self-referential FKs: Fully supported (e.g., users.manager_id → users.id)
Multi-table cycles: Not supported - use --parents or --children to break the cycle

Deterministic Output

Given identical inputs and database state, pg_rocket produces byte-for-byte identical output:

Rows are sorted by primary key within each table
Tables are sorted topologically (parents before children)
Timestamps and output are consistent

This makes pg_rocket suitable for:

Version control of test fixtures
Regression testing
Auditable data exports

Limitations

Single schema: Only traverses the public schema
PostgreSQL only: No support for MySQL, SQLite, etc.
In-memory: All data is loaded into memory before output
No data masking: Use separate tools for anonymization
No parallel extraction: Single-threaded traversal

Security Considerations

⚠️ Important Security Notes:

Credential Management: Never commit DSNs with passwords to version control. Use environment variables or secure credential stores.

Read-Only Connections: Use read-only database users for --source when possible:

CREATE USER pg_rocket_readonly WITH PASSWORD 'secure_password';
GRANT CONNECT ON DATABASE mydb TO pg_rocket_readonly;
GRANT USAGE ON SCHEMA public TO pg_rocket_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO pg_rocket_readonly;

SSL/TLS: Always use sslmode=require or higher for production:

export PGROCKET_SOURCE="postgres://user:pass@prod:5432/db?sslmode=require"

Row Limits: Default 10,000-row limit prevents accidental large extractions. Use --max-rows or --force consciously.
Confirmation Prompts: --exec mode requires explicit user confirmation before database writes.
Separate Environments: The tool validates that source and target are different to prevent accidental overwrites.

Roadmap

Planned for future releases:

Multi-schema support
Data masking/anonymization
Parallel extraction for performance
MySQL and SQLite support
Configuration files (.pg_rocket.yml)
Incremental extraction (delta mode)

Contributing

Contributions are welcome!

Fork the repository
Create a feature branch
Make your changes and add tests
Submit a pull request

See CONTRIBUTING.md for details.

License

MIT License - see LICENSE file for details.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions

Built with Go and pgx

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
cmd		cmd
docs		docs
internal		internal
test/integration		test/integration
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

License

riyasyash/pg_rocket

Folders and files

Latest commit

History

Repository files navigation

pg_rocket 🚀

Why pg_rocket?

Features

Installation

Homebrew (macOS/Linux)

From Binary

From Source

Using Go Install

Requirements

Quick Start

Commands

pg_rocket pull

Required

Connection

Traversal

Output

Control

pg_rocket inspect

pg_rocket version

Usage Examples

Extract User with All Related Data

Parents Only (Upward Traversal)

Selective Children (Downward Traversal)

JSON Output

Direct Database Transfer

Successive Runs with Upsert

Configuration

Connection Strings

SSL/TLS Security

Environment Variables

Safety Features

Pre-Execution Validation

User Confirmation for Database Writes

Row Limits

Foreign Key Integrity Validation

How It Works

Traversal Algorithm

Schema Requirements

✅ Supported

❌ Not Supported

Compatibility Check

Troubleshooting

"table does not have a primary key"

"cycle detected in table dependencies"

"primary key column not included in SELECT"

"foreign key integrity violation"

Performance

Benchmarks

Development

Building from Source

Running Tests

Project Structure

Advanced Usage

Handling Large Datasets

Production → Staging Sync

Extracting Tenant Data

Date-Scoped Extraction

How Foreign Key Traversal Works

Upward Traversal (Parents)

Downward Traversal (Children)

Full Traversal (Default)

Cycle Handling

Deterministic Output

Limitations

Security Considerations

Roadmap

Contributing

License

Support

About

Resources

License

Contributing

Uh oh!

`pg_rocket pull`

`pg_rocket inspect`

`pg_rocket version`

Packages