Skip to content

Refine deps caching#29

Open
yaito3014 wants to merge 29 commits intomainfrom
refine-deps-caching
Open

Refine deps caching#29
yaito3014 wants to merge 29 commits intomainfrom
refine-deps-caching

Conversation

@yaito3014
Copy link
Member

related PR: iris-cpp/iris#32

@yaito3014 yaito3014 added the CI Build system issues label Feb 15, 2026
@saki7
Copy link
Member

saki7 commented Feb 16, 2026

Need to resolve the conflict and recheck if the cache can be resaved properly. If not, I think we would need to use "deps2" for now.

@saki7
Copy link
Member

saki7 commented Feb 16, 2026

https://github.com/iris-cpp/x4/actions/runs/22061450058/job/63742469284#step:15:59

Failed to save: Unable to reserve cache with key deps2-windows-2022-msvc-19.44.35222.0-C++23-Release, another job may be creating this cache.
Warning: Cache save failed.

oh 🥹

@saki7
Copy link
Member

saki7 commented Feb 16, 2026

My hypothesis:

  • Using a cache from the main branch and re-saving it within the PR scope (refs/pull/XX/merge) results in the error
    • Confirmed wrong, because using the brand new key deps2- still triggered the error
  • The old state had ~150 caches, resulting in no available disk space in our repository-wide environment

I just removed ~100 obsolete caches manually. That's roughly 20MB * 100 = 2000MB of space freed. Perhaps we should re-run the jobs and see what happens?

@saki7
Copy link
Member

saki7 commented Feb 16, 2026

I think I found out what the real problem is.

In a full build, we have two distinct sets of jobs: alloy and x4. The "alloy" job usually runs faster and completes the restore/save steps successfully. However, the "x4" job runs slower, and tries to reference the cache which is probably used by the "alloy" jobs. In theory, if the corresponding "alloy" jobs finished entirely before the "x4" job, the "x4" job should not see the error. But I think the current design of GHA somehow locks the existing cache for the entire workflow run, resulting in the error.

So I think the error can be safely ignored.

Now this is the important part: the legitimate problem we have is that our "alloy" and "x4" jobs are currently too distinct. I think we should extract the common steps into a separate job (for example, "prepare-build") and then reference it by needs: prepare-build.

This way we would have only single restore/save step executed by the "prepare-build" job.

@yaito3014 Could you test this solution?

@saki7
Copy link
Member

saki7 commented Feb 16, 2026

By the way, I think we should revert the name to "deps" because "deps2" was not working

@saki7
Copy link
Member

saki7 commented Feb 16, 2026

Very nice work on the script!!

I think the dependency chain is wrong:

image

The expected dependency is that two jobs (i.e. x4/alloy jobs) for the each build configuration connect to the exactly one "prepare-deps" job. Currently the entire two matrixes are connected to each other, preventing fully parallel runs.

@saki7
Copy link
Member

saki7 commented Feb 16, 2026

@yaito3014 Maybe related: actions/stale#1090 (comment)

@saki7
Copy link
Member

saki7 commented Feb 16, 2026

@yaito3014 Maybe related: actions/cache#485 (comment)

@saki7 saki7 self-assigned this Feb 17, 2026
@saki7
Copy link
Member

saki7 commented Feb 17, 2026

Remaining task

  • Properly specify VS 2022 toolchain in CMake domain (currently 2026 is wrongly used)
  • Make the job name cleaner
    • Drop the "CI / " prefix
    • Drop the "@ ..." suffix
    • Drop the "(pull_request)" suffix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Build system issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants