We have a CI pipeline spread across:
OpenHands/software-agent-sdk
OpenHands/benchmarks
OpenHands/evaluation
Add the terminalbench benchmark to all of these repos as necessary, and trigger a run on software-agent-sdk CI to make sure that this works. If it doesn't work with 5 examples, debug and figure out why.