Job Traceability, Management, and Audibility Overhaul by bencap · Pull Request #645 · VariantEffect/mavedb-api

bencap · 2026-01-29T20:03:45Z

This PR introduces a robust, auditable, and maintainable background job system for MaveDB, supporting both standalone jobs and complex pipelines. It provides a strong foundation for future workflow automation, error recovery, and developer onboarding.

Features include:

1. Worker Job System Refactor & Enhancements
Refactored the monolithic worker job system into a modular architecture:

Split jobs into domain-specific modules: data_management, external_services, variant_processing, etc ( a94c2fb, new files in jobs).
Centralized job registration in registry.py for ARQ worker configuration (a94c2fb, 0416b2d).
Added standalone job definitions and improved lifecycle context for job submission (0416b2d, 60ef67d).
Integrated PipelineFactory for variant creation and update processes. (See: 987b38a)

2. Job & Pipeline Management Infrastructure
Implemented JobManager and PipelineManager classes for robust job and pipeline lifecycle management: (05fc52b, ae18eeb, 3799d84, 1e447a7, 7b44346)

Atomic state transitions, progress tracking, retry logic, and error handling.
Pipeline coordination, dependency management, pausing/unpausing, and cancellation.
Custom exception hierarchies for clear error recovery.
Added context manager for database session management and streamlined context handling in decorators (3ca697a, c61bd41, 010f15c).

3. Decorator System for Jobs and Pipelines
Introduced decorators for job and pipeline management:
(c2100a2, 155e549, 4a4055d, 3c4e6b9, 010f15c)

with_job_management: Automatic job lifecycle management, error handling, and progress tracking.
with_pipeline_management: Pipeline coordination after job completion.
with_guaranteed_job_run_record: Ensures audit trail by persisting a JobRun record before execution.
Improved test mode support and simplified stacking/usage patterns.

4. Comprehensive Test Suite
Added and refactored unit and integration tests for all job modules, managers, and decorators.
(05fc52b, ae18eeb, a701d53, 806f8ed, 011522c, 010f15c, 8a22306, a716cc9, b0397b4, 8c5e225, 3c4e6b9, 4a4055d, 1fe076a, 1abe4c6)
Enhanced test coverage for error handling, state transitions, and job orchestration.
Introduced fixtures and utilities for easier test setup and mocking.
Categorized tests with markers for unit, integration, and network tests.
(16a5a50, f34939c)

5. Developer Documentation
Added detailed markdown documentation in the worker/jobs/] directory:
(1abe4c6)

System overview, decorator usage, manager responsibilities, pipeline management, job registry/configuration, and best practices.
Entry-point README and table of contents for easy navigation.
Guidance on error handling, job design, and testing strategies.

6. Database & Model Changes

Added new tables and enums for job traceability JobRun, Pipeline, JobDependency, etc.
(1db6b68)
Alembic migration for pipeline and job tracking schema.
Updated models and enums to support new job/pipeline features.

7. Miscellaneous Improvements
Dependency updates (e.g., added asyncclick).
(a3f36d1)

Logging and error reporting enhancements.
Cleaned up legacy code, removed obsolete files, and improved code organization.

…cture Break down 1767-line jobs.py into domain-driven modules, improving maintainability and developer experience. - variant_processing/: Variant creation and VRS mapping - external_services/: ClinGen, UniProt, gnomAD integrations - data_management/: Database and view operations - utils/: Shared utilities (state, retry, constants) - registry.py: Centralized ARQ job configuration - constants.py: Environment configuration - redis.py: Redis connection settings - lifecycle.py: Worker lifecycle hooks - worker.py: Main ArqWorkerSettings class - All job functions maintain identical behavior - Registry provides BACKGROUND_FUNCTIONS/BACKGROUND_CRONJOBS lists for ARQ initialization - Test structure mirrors source organization This refactor ensures ARQ worker initialization is backwards compatible. The modular architecture establishes a more maintainable foundation for MaveDB's automated processing workflows while preserving all existing functionality.

Implement complete database foundation for pipeline-based job tracking and monitoring: Database Tables: • pipelines - High-level workflow grouping with correlation IDs for end-to-end tracing • job_runs - Individual job execution tracking with full lifecycle management • job_dependencies - Workflow orchestration with success/completion dependency types • job_metrics - Detailed performance metrics (CPU, memory, execution time, business metrics) • variant_annotation_status - Granular variant-level annotation tracking with success data Key Features: • Pipeline workflow management with dependency resolution • Comprehensive job lifecycle tracking (pending → running → completed/failed) • Retry logic with configurable limits and backoff strategies • Resource usage and performance metrics collection • Variant-level annotation status for debugging failures • Correlation ID support for request tracing across system • JSONB metadata fields for flexible job-specific data • Optimized indexes for common query patterns Schema Design: • Foreign key relationships maintain data integrity • Check constraints ensure valid enum values and positive numbers • Strategic indexes optimize dependency resolution and metrics queries • Cascade deletes prevent orphaned records • Version tracking for audit and debugging Models & Enums: • SQLAlchemy models with proper relationships and hybrid properties • Comprehensive enum definitions for job/pipeline status and failure categories

…pings

…ods and failures

…atures

Add comprehensive job lifecycle management with status-based completion: * Implement convenience methods for common job outcomes: - succeed_job() for successful completion - fail_job() for error handling with exception details - cancel_job() for user/system cancellation - skip_job() for conditional job skipping * Enhance progress tracking with increment_progress() and set_progress_total() * Add comprehensive error handling with specific exception types * Improve job state validation and atomic transaction handling * Implement extensive test coverage for all job operations

- Created PipelineManager capable of coordinating jobs within a pipeline context - Introduced `construct_bulk_cancellation_result` to standardize cancellation result structures. - Added `job_dependency_is_met` to check job dependencies based on their types and statuses. - Created comprehensive tests for PipelineManager covering initialization, job coordination, status transitions, and error handling. - Implemented mocks for database and Redis dependencies to isolate tests. - Added tests for job enqueuing, cancellation, pausing, unpausing, and retrying functionalities.

Adds decorators for managed jobs and pipelines. These can be applied to async ARQ functions to automatically persist their state as they execute

In certain instances (cron jobs in particular), worker processes are invoked from contexts where we have not yet added a job run record to the database. In such cases, it becomes useful to first guarantee a minimal record is added to the database such that the job run can be tracked via existing managed job decorators. This feature adds such a decorator and associated tests.`

Since decorators are applied at import time, this test mode path is a pragmatic solution to run decorators without side effects during unit tests. It's more straightforward and maintainable than other solutions, and still lets us import job definitions up front to register with ARQ.

Additionally contains some small updates to how decorator unit tests handle the new test mode flag.

- Integrated `send_slack_error` calls in multiple test cases across different modules to ensure error notifications are sent when exceptions occur. - Updated tests for materialized views, published variants, Clingen submissions, GnomAD linking, UniProt mappings, pipeline management, and variant processing to assert that Slack notifications are triggered on failures. - Enhanced error handling in job management decorators to include Slack notifications for missing context and job failures.

…ClinGen tests

…or missing Redis in PipelineManager

…ation and error handling

…b management

…job management

…d job management

…sion process and enhance logging

… enum to str inheritance

…ance test coverage

…gnomad tests

…year and month parameters

…ll tests on main branch

…das for 3.12 support

@cached

…calls Implements 24-hour Redis cache for ClinGen Allele Registry API responses, significantly reducing API load when processing multiple ClinVar control versions that query the same alleles. Converts three ClinGen functions to async with @cached decorator, implements memory backend for testing, and handles 404 responses as cacheable "no data" results while raising exceptions for other API failures. Includes comprehensive test coverage and type stubs for the untyped aiocache library. - Add aiocache optional dependency with Redis backend support - Create cache configuration module with environment-based backend selection - Convert get_canonical_pa_ids, get_matching_registered_ca_ids, and get_associated_clinvar_allele_id to async cached functions - Return empty string/list for "no data" cases to enable caching of modal outcomes - Implement 404-specific error handling: cache permanent absences, raise for transient failures - Add memory cache backend for testing without Redis dependency - Create type stubs for aiocache.Cache and aiocache.cached decorator - Add 43 new tests covering caching behavior, configuration, and network interactions

… better transaction control

Add periodic cleanup job to detect and recover jobs stuck in QUEUED, RUNNING, or PENDING states beyond timeout thresholds. Jobs can become stalled due to worker crashes, race conditions during enqueue, network issues, or database transaction failures. Cleanup logic: - QUEUED jobs stalled >10 min (stuck between state change and ARQ pickup) - RUNNING jobs stalled >60 min (worker likely crashed mid-execution) - PENDING jobs stalled >30 min (pipeline coordination failure) Unified retry handler workflow: 1. Fail job with TIMEOUT category for being stalled 2. Check retry eligibility via should_retry() 3. If eligible: prepare retry and check dependencies 4. For pipeline jobs: validate dependencies before enqueueing - Skip if dependencies failed (leave in PENDING for pipeline manager) - Wait if dependencies not ready (leave in PENDING) - Enqueue if dependencies satisfied 5. If max retries exceeded or enqueue fails: mark SYSTEM_ERROR Key features: - Graceful handling of edge cases (missing started_at, max retries) - Pipeline dependency awareness (avoids enqueueing guaranteed failures) - Comprehensive test coverage (42 tests: 22 unit, 19 integration, 1 ARQ) This safety net ensures jobs don't remain in limbo indefinitely and provides automatic recovery from transient infrastructure failures.

This was linked to issues Jan 29, 2026

Traceability and Auditing for Variant-level Job Results #627

Open

Consider Retaining Score and Count Raw Files #262

Open

bencap force-pushed the feature/bencap/627/job-traceability branch 6 times, most recently from da26721 to f3ea5ce Compare February 4, 2026 23:08

Base automatically changed from feature/bencap/derived-gene-name-from-mapped-output to release-2025.6.0 February 4, 2026 23:58

Base automatically changed from release-2025.6.0 to main February 6, 2026 19:08

bencap added 20 commits February 16, 2026 19:12

fix(logging): simplify context saving logic to overwrite existing map…

fd35ac4

…pings

tests: add TransactionSpy class for mocking database transaction meth…

7ca0c9f

…ods and failures

feat: add BaseManager class with transaction handling and rollback fe…

314a469

…atures

feat: add function to check if job dependencies are reachable

c6f72bb

feat: add markers for test categorization in pytest

d77cf68

fix: mock job manager returning in fixture rather than yielding

7548bbf

fix: enhance error logging for job and pipeline state transitions

cd2fab5

fix: re-order imports in job manager test file

7ee3ce1

fix: use conftest_optional import structure in worker test module

7ec5c40

feat: Add decorators for job and pipeline management

749c512

Adds decorators for managed jobs and pipelines. These can be applied to async ARQ functions to automatically persist their state as they execute

feat: use context for logging in job manager

d28279d

fix: simplify exc handling in job management decorator

eb6aa64

Additionally contains some small updates to how decorator unit tests handle the new test mode flag.

feat: allow pipelines to be started by decorated jobs

9a9f77f

tests: unit tests for worker manager utilities

a8655ab

bencap added 20 commits February 16, 2026 19:15

chore: test cleanup

e50a34b

fix: remove ga4gh packages from server group

8efce81

docs: minimal developer docs via copilot for worker jobs

7b403ad

fix: mypy typing

aeb5c08

fix: test attempting to connect via socket to athena

20a4e24

fix: update TODO comments for clarity and specificity in UniProt and …

642a64b

…ClinGen tests

feat: make Redis client optional in managers and add error handling f…

9e10bc5

…or missing Redis in PipelineManager

feat: implement create_job_dependency method in JobFactory with valid…

c3e90db

…ation and error handling

feat: refactor UniProt ID mapping script to use async commands and jo…

1fb23ad

…b management

feat: refactor link_gnomad_variants script to use async commands and …

1870eeb

…job management

feat: refactor clingen_car_submission script to use async commands an…

135f278

…d job management

feat: refactor clingen_ldh_submission script to streamline job submis…

d153744

…sion process and enhance logging

feat: clinvar clinical control refresh job + script

5ee162b

feat: update annotation type handling to use enum directly and switch…

06f77e7

… enum to str inheritance

feat: add functions to retrieve associated ClinVar Allele IDs and enh…

bba9e3b

…ance test coverage

refactor: remove redundant fixture for setting up sample variants in …

3097942

…gnomad tests

chore: add TODO for caching ClinVar control data to improve performance

d37e7e6

feat: add multiple refresh job definitions for ClinVar controls with …

d915035

…year and month parameters

feat: enhance test workflow to run fast tests on pull requests and fu…

33be31f

…ll tests on main branch

bencap force-pushed the feature/bencap/627/job-traceability branch from f3ea5ce to 33be31f Compare February 17, 2026 03:16

bencap added 4 commits February 16, 2026 19:18

chore: remove deprecated pkg_resources and replace w stdlib. Bump pan…

7614c36

…das for 3.12 support

chore: lock deps

93e8519

feat: add commit option to job progress and status update methods for…

1fb9fdd

… better transaction control

bencap force-pushed the feature/bencap/627/job-traceability branch from 1aa5409 to 1fb9fdd Compare February 17, 2026 19:40

bencap added 2 commits February 17, 2026 15:11

fix: correct type annotations in cleanup.py

f120ed5

bencap changed the base branch from main to release-2026.2.0 February 17, 2026 23:28

bencap marked this pull request as ready for review February 17, 2026 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job Traceability, Management, and Audibility Overhaul#645

Job Traceability, Management, and Audibility Overhaul#645
bencap wants to merge 76 commits intorelease-2026.2.0from
feature/bencap/627/job-traceability

bencap commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bencap commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant