CNTR-3: Supervisor failover test #5071

goabhinav · 2026-02-10T09:24:45Z

No description provided.

gemini-code-assist · 2026-02-10T09:24:59Z

Summary of Changes

Hello @goabhinav, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical new test to validate the resilience of containerized services on network devices. By simulating a control processor failover, the test ensures that deployed containers and their associated volumes maintain their state and functionality, thereby enhancing the overall stability and reliability of the system's container supervisor capabilities.

Highlights

New Failover Test: Introduced a new test case (CNTR-3) to verify the persistence of containers and volumes across a control processor switchover (failover) event.
Test Procedure Documentation: Added a comprehensive README file outlining the summary, detailed procedure for building the test container, and the steps for the failover test itself.
Robust Recovery Verification: The test includes steps to deploy a container and volume, trigger a control processor switchover, and then verify that the container is still running and the volume persists after the device recovers.

Changelog

feature/container/failover/README.md
- Added a new README file detailing the container supervisor failover test (CNTR-3).
- Included instructions for building the test container and the step-by-step procedure for the failover test.
- Documented the OpenConfig RPCs covered by the test.
feature/container/failover/failover_test.go
- Added a new Go test file implementing the container supervisor failover test.
- Implemented functions to deploy containers and volumes, trigger control processor switchovers, and verify post-failover state.
- Included helper functions for finding standby RPs, verifying container states, checking volume existence, and waiting for DUT reboot.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

OpenConfigBot · 2026-02-10T09:25:10Z

Pull Request Functional Test Report for #5071 / `fa79887`

No tests identified for validation.

Help

gemini-code-assist

Code Review

This pull request introduces a new test for container supervisor failover (CNTR-3). The changes include a README.md file detailing the test procedure and a Go test file failover_test.go that implements the test. The test correctly follows the structure of setting up a container and volume, triggering a control processor switchover, and then verifying that the container and volume persist after recovery.

My review focuses on adherence to the repository's style guide. I've identified one instance where time.Sleep is used redundantly, which is discouraged. I've provided a suggestion to remove it to improve test efficiency. Otherwise, the code is well-written and the test logic is sound.

feature/container/failover/failover_test.go

alshabib · 2026-02-10T16:55:19Z

feature/container/failover/README.md

+
+1.  **Setup**: Using `gnoi.Containerz`, deploy a container image, create a volume, and start a container mounting that volume. Verify the container is running and the volume exists.
+2.  **Trigger Failover**: Identify the standby control processor using gNMI. Trigger a switchover using `gnoi.System.SwitchControlProcessor`.
+3.  **Verify Recovery**: Wait for the switchover to complete. Verify that the container started in step 1 is in `RUNNING` state and the volume still exists using `gnoi.Containerz`.


I think we should check that RPCs to the container work as well.

alshabib · 2026-02-10T16:58:16Z

feature/container/failover/README.md

@@ -0,0 +1,71 @@
+# CNTR-3: Container Supervisor Failover


This is a good start but we should test several other scenarios like what happens when the backup is not available and containers are started. Do the containers get started when the backup returns?

The question is how can we map this to gNOI calls on the vendor devices; can we manually kill and restart a containerz instance on the individual supervisors?

Additional scenario to consider:
What happens when the primary returns after a failover? Will the original containers still be available? Are modifications to containers/volumes that only reached the backup properly replicated back to the primary?

Could be simulated by calling SwitchControlProcessor twice. Although the SwitchControlProcessor implementation may just switch the handling supervisor without actually restarting the primary.

f-sommerauer · 2026-02-10T16:21:31Z

feature/container/failover/failover_test.go

+
+const (
+	imageName      = "cntrsrv_image"
+	tag            = "latest"


tag is unused; can be dropped

f-sommerauer · 2026-02-10T16:59:32Z

feature/container/failover/failover_test.go

+		if err := verifyVolumeExists(ctx, cli, volName); err != nil {
+			t.Fatalf("Volume not found after creation: %v", err)
+		}


would consider moving this up after create volume (slightly easier to read)

f-sommerauer · 2026-02-10T17:03:13Z

feature/container/failover/failover_test.go

+		if vol.Error != nil {
+			return fmt.Errorf("error listing volumes: %w", vol.Error)
+		}
+		if vol.Name == name {


no potential leading slash here?

f-sommerauer · 2026-02-10T17:11:45Z

feature/container/failover/README.md

+
+## CNTR-3.1: Container Supervisor Failover
+
+1.  **Setup**: Using `gnoi.Containerz`, deploy a container image, create a volume, and start a container mounting that volume. Verify the container is running and the volume exists.


start a container mounting that volume; Is the container in failover_test.go actually mounting the volume?

f-sommerauer · 2026-02-10T17:23:04Z

feature/container/failover/README.md

@@ -0,0 +1,71 @@
+# CNTR-3: Container Supervisor Failover


The question is how can we map this to gNOI calls on the vendor devices; can we manually kill and restart a containerz instance on the individual supervisors?

Additional scenario to consider:
What happens when the primary returns after a failover? Will the original containers still be available? Are modifications to containers/volumes that only reached the backup properly replicated back to the primary?

Could be simulated by calling SwitchControlProcessor twice. Although the SwitchControlProcessor implementation may just switch the handling supervisor without actually restarting the primary.

MarcCharlebois · 2026-02-10T19:40:27Z

feature/container/failover/failover_test.go

+		time.Sleep(switchoverWait)
+
+		// Refresh clients after reconnection.
+		cli = containerztest.Client(t, dut)


Validate that standby supervisor is not the same as previously obtained before attempting reconnection. We could end up reconnecting to the same supervisor here.

MarcCharlebois · 2026-02-10T19:42:39Z

feature/container/failover/failover_test.go

+	// Raw system client for switchover.
+	sysClient := dut.RawAPIs().GNOI(t).System()
+
+	t.Run("Setup", func(t *testing.T) {


probably would like to see either a cleanup function at the end which removes everything this test created or perhaps a cleanup for the volume in case this test is ran >1.

CNTR-3: Supervisor failover test

a7a6ed7

goabhinav requested a review from a team as a code owner February 10, 2026 09:24

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

feature/container/failover/failover_test.go Outdated Show resolved Hide resolved

goabhinav added 3 commits February 10, 2026 11:18

adjust timeouts for switchover

d0d964a

use correct container start options

43c2d74

Fix log message

fa79887

alshabib reviewed Feb 10, 2026

View reviewed changes

f-sommerauer reviewed Feb 10, 2026

View reviewed changes

MarcCharlebois reviewed Feb 10, 2026

View reviewed changes


		## CNTR-3.1: Container Supervisor Failover

		1. Setup: Using `gnoi.Containerz`, deploy a container image, create a volume, and start a container mounting that volume. Verify the container is running and the volume exists.

CNTR-3: Supervisor failover test #5071

Are you sure you want to change the base?

CNTR-3: Supervisor failover test #5071

Uh oh!

Conversation

goabhinav commented Feb 10, 2026

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

OpenConfigBot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Functional Test Report for #5071 / fa79887

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

OpenConfigBot commented Feb 10, 2026 •

edited

Loading

Pull Request Functional Test Report for #5071 / `fa79887`