Skip to content

Comments

feat: add Prometheus metrics for backup recovery window#69

Open
ermakov-oleg wants to merge 1 commit intooperasoftware:mainfrom
ermakov-oleg:feat/prometheus-metrics
Open

feat: add Prometheus metrics for backup recovery window#69
ermakov-oleg wants to merge 1 commit intooperasoftware:mainfrom
ermakov-oleg:feat/prometheus-metrics

Conversation

@ermakov-oleg
Copy link

Summary

Port of upstream #459, #467

Problem: No observability into backup health — operators had no way to alert on stale backups or monitor recovery point objectives (RPO) without manually querying pgBackRest.

Fix: Implements the cnpg-i Metrics service, exposing two Prometheus gauges:

  • cnpg_pgbackrest_first_recoverability_point — unix timestamp of the earliest restore point (first successful backup stop time)
  • cnpg_pgbackrest_last_available_backup_timestamp — unix timestamp of the most recent completed backup (latest backup stop time)

These allow standard Prometheus alerts like "no backup in last 24h" or "RPO exceeds 1h".

Implementation:

  • New MetricsServiceImplementation in internal/cnpgi/instance/metrics.go
  • Registers TYPE_METRICS capability in plugin identity
  • Collect() calls pgbackrest info to get the backup catalog, then delegates to getRecoveryWindow() which uses catalog.FirstRecoverabilityPoint() and catalog.GetLastSuccessfulBackupTime() — these methods filter out errored backups (Start=0 or Stop=0) and use Time.Stop for recoverability
  • Returns 0 for both metrics if no backups exist or credentials fail (graceful degradation)

Unit tests in metrics_test.go cover: nil/empty catalog, single backup, multiple backups, errored backups filtering, all-errored catalog.

Related issues

Signed-off-by: ermakov-oleg <ermakovolegs@gmail.com>
@ermakov-oleg
Copy link
Author

Hi @Agalin, just following up on this PR - would you have a chance to review it when you have time? The changes from all my PRs have been running in our production for a while now without issues, but I’m happy to adjust anything if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant