-
Notifications
You must be signed in to change notification settings - Fork 200
CNTR-3: Supervisor failover test #5071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @goabhinav, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a critical new test to validate the resilience of containerized services on network devices. By simulating a control processor failover, the test ensures that deployed containers and their associated volumes maintain their state and functionality, thereby enhancing the overall stability and reliability of the system's container supervisor capabilities. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new test for container supervisor failover (CNTR-3). The changes include a README.md file detailing the test procedure and a Go test file failover_test.go that implements the test. The test correctly follows the structure of setting up a container and volume, triggering a control processor switchover, and then verifying that the container and volume persist after recovery.
My review focuses on adherence to the repository's style guide. I've identified one instance where time.Sleep is used redundantly, which is discouraged. I've provided a suggestion to remove it to improve test efficiency. Otherwise, the code is well-written and the test logic is sound.
|
|
||
| 1. **Setup**: Using `gnoi.Containerz`, deploy a container image, create a volume, and start a container mounting that volume. Verify the container is running and the volume exists. | ||
| 2. **Trigger Failover**: Identify the standby control processor using gNMI. Trigger a switchover using `gnoi.System.SwitchControlProcessor`. | ||
| 3. **Verify Recovery**: Wait for the switchover to complete. Verify that the container started in step 1 is in `RUNNING` state and the volume still exists using `gnoi.Containerz`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should check that RPCs to the container work as well.
| @@ -0,0 +1,71 @@ | |||
| # CNTR-3: Container Supervisor Failover | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start but we should test several other scenarios like what happens when the backup is not available and containers are started. Do the containers get started when the backup returns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is how can we map this to gNOI calls on the vendor devices; can we manually kill and restart a containerz instance on the individual supervisors?
Additional scenario to consider:
What happens when the primary returns after a failover? Will the original containers still be available? Are modifications to containers/volumes that only reached the backup properly replicated back to the primary?
Could be simulated by calling SwitchControlProcessor twice. Although the SwitchControlProcessor implementation may just switch the handling supervisor without actually restarting the primary.
|
|
||
| const ( | ||
| imageName = "cntrsrv_image" | ||
| tag = "latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tag is unused; can be dropped
| if err := verifyVolumeExists(ctx, cli, volName); err != nil { | ||
| t.Fatalf("Volume not found after creation: %v", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would consider moving this up after create volume (slightly easier to read)
| if vol.Error != nil { | ||
| return fmt.Errorf("error listing volumes: %w", vol.Error) | ||
| } | ||
| if vol.Name == name { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no potential leading slash here?
|
|
||
| ## CNTR-3.1: Container Supervisor Failover | ||
|
|
||
| 1. **Setup**: Using `gnoi.Containerz`, deploy a container image, create a volume, and start a container mounting that volume. Verify the container is running and the volume exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start a container mounting that volume; Is the container in failover_test.go actually mounting the volume?
| @@ -0,0 +1,71 @@ | |||
| # CNTR-3: Container Supervisor Failover | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is how can we map this to gNOI calls on the vendor devices; can we manually kill and restart a containerz instance on the individual supervisors?
Additional scenario to consider:
What happens when the primary returns after a failover? Will the original containers still be available? Are modifications to containers/volumes that only reached the backup properly replicated back to the primary?
Could be simulated by calling SwitchControlProcessor twice. Although the SwitchControlProcessor implementation may just switch the handling supervisor without actually restarting the primary.
| time.Sleep(switchoverWait) | ||
|
|
||
| // Refresh clients after reconnection. | ||
| cli = containerztest.Client(t, dut) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate that standby supervisor is not the same as previously obtained before attempting reconnection. We could end up reconnecting to the same supervisor here.
| // Raw system client for switchover. | ||
| sysClient := dut.RawAPIs().GNOI(t).System() | ||
|
|
||
| t.Run("Setup", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably would like to see either a cleanup function at the end which removes everything this test created or perhaps a cleanup for the volume in case this test is ran >1.
No description provided.