Skip to content

Conversation

@iliana
Copy link
Contributor

@iliana iliana commented Jan 31, 2026

@jclulow reported this job: https://buildomat.eng.oxide.computer/wg/0/details/01KG5JNVNBG9BBVWG8S874737E/Ar9EJfdSbD1qqfZ2jmuXxeF7mjn15W2b1GpI5ljIsFKgOJNn/01KG5JPR9Q4AHGSK2A6CYK3FPM#S367

Our check that the switch zone is up was, for whatever reason, hanging for approximately 195 seconds on each iteration. This resulted in what should have been a 30-second timeout becoming about 90 minutes.

This does not fix whatever the root cause is, but it does keep it from hogging one of our two available machines that can run this CI job.

@FelixMcFelix
Copy link
Contributor

FelixMcFelix commented Feb 2, 2026

I think this is fine, in the successful CI run here it takes around 13 attempts (~23s) to reach the service in the switch zone. It looks like in recent successful runs on main we have a few quick RSTs, one of the queries takes ~20s before failing, and then we have a successful query. So we're waiting around the same ballpark, and being more honest about it. I don't know what the upper bound/variance on omicron1 zone bringup is, but I assume we can always adjust the retry count if it turns out we're bumping into 30 retries regularly (other than cases where the switch zone just isn't coming up).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants