-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Description
crawl4ai version
0.7.8 (also confirmed in main branch / 0.8.0)
Expected Behavior
chardet.detect() in _handle_http should not block the asyncio event loop.
Since chardet.detect() is a CPU-bound synchronous call that can take several seconds on large pages, it should be wrapped with await asyncio.to_thread() to avoid blocking the event loop — similar to how PDF processing already uses asyncio.to_thread in the codebase.
Suggested fix in async_crawler_strategy.py line 2451:
Before (blocking):
encoding = chardet.detect(content.tobytes())['encoding'] or 'utf-8'
After (non-blocking):
detected = await asyncio.to_thread(chardet.detect, content.tobytes())
encoding = detected['encoding'] or 'utf-8'
Current Behavior
When crawling pages with HTTP strategy, _handle_http calls chardet.detect(content.tobytes()) synchronously on the event loop (line 2451 in async_crawler_strategy.py). For large pages, this blocks the event loop for multiple seconds, causing lag for all concurrent async tasks.
We detected this using an event loop watchdog thread that captures the blocking call stack in real-time. Here is the stack trace:
Event loop lag detected: >0.500s
[BLOCKED STACK TRACE of event-loop thread]:
File "crawl4ai/async_crawler_strategy.py", line 2451, in _handle_http
encoding = chardet.detect(content.tobytes())['encoding'] or 'utf-8'
File "chardet/__init__.py", line 49, in detect
detector.feed(byte_str)
File "chardet/universaldetector.py", line 274, in feed
if prober.feed(byte_str) == ProbingState.FOUND_IT:
File "chardet/charsetgroupprober.py", line 70, in feed
state = prober.feed(byte_str)
File "chardet/sbcharsetprober.py", line 122, in feed
self._last_order = order
This was observed repeatedly (8 consecutive lag warnings in a single page crawl), indicating the chardet detection blocked the event loop for ~8 seconds total.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
Linux
Python version
3.11
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response