feat: add Prometheus metrics for backup recovery window#69
Open
ermakov-oleg wants to merge 1 commit intooperasoftware:mainfrom
Open
feat: add Prometheus metrics for backup recovery window#69ermakov-oleg wants to merge 1 commit intooperasoftware:mainfrom
ermakov-oleg wants to merge 1 commit intooperasoftware:mainfrom
Conversation
Signed-off-by: ermakov-oleg <ermakovolegs@gmail.com>
1250c04 to
331f244
Compare
Author
|
Hi @Agalin, just following up on this PR - would you have a chance to review it when you have time? The changes from all my PRs have been running in our production for a while now without issues, but I’m happy to adjust anything if needed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Port of upstream #459, #467
Problem: No observability into backup health — operators had no way to alert on stale backups or monitor recovery point objectives (RPO) without manually querying pgBackRest.
Fix: Implements the cnpg-i Metrics service, exposing two Prometheus gauges:
cnpg_pgbackrest_first_recoverability_point— unix timestamp of the earliest restore point (first successful backup stop time)cnpg_pgbackrest_last_available_backup_timestamp— unix timestamp of the most recent completed backup (latest backup stop time)These allow standard Prometheus alerts like "no backup in last 24h" or "RPO exceeds 1h".
Implementation:
MetricsServiceImplementationininternal/cnpgi/instance/metrics.goTYPE_METRICScapability in plugin identityCollect()callspgbackrest infoto get the backup catalog, then delegates togetRecoveryWindow()which usescatalog.FirstRecoverabilityPoint()andcatalog.GetLastSuccessfulBackupTime()— these methods filter out errored backups (Start=0 or Stop=0) and useTime.Stopfor recoverabilityUnit tests in
metrics_test.gocover: nil/empty catalog, single backup, multiple backups, errored backups filtering, all-errored catalog.Related issues
pgbackrest infoor an equivalent #19