When running benchmarks (e.g., swebenchmultimodal), scripts often expect pure JSON output on stdout to be piped to tools like jq.
However, benchmarks/utils/console_logging.py currently configures the console handler to write to stdout when rich logging is enabled (or defaulted).
This causes issues when libraries (like OpenTelemetry) log errors or warnings. For example, a Failed to detach context error from OpenTelemetry is printed to stdout, which then causes jq to fail with:
jq: parse error: Invalid numeric literal
This happens because the log message is mixed with the JSON output.
Fix: Ensure all console logs are written to stderr, leaving stdout exclusively for the script's intended output (JSON).