Fix redirect target verification in AsyncUrlSeeder and enhance tests#1622
Fix redirect target verification in AsyncUrlSeeder and enhance tests#1622Ahmed-Tawfik94 wants to merge 1 commit intodevelopfrom
Conversation
- Added `verify_redirect_targets` parameter to control redirect verification. - Modified `_resolve_head()` to verify redirect targets based on the new parameter. - Implemented tests for both verification modes, ensuring dead redirects are filtered out and legacy behavior is preserved.
|
@Ahmed-Tawfik94 Why didn't you implement the recommended solution you provided in the root cause message here? #1603 (comment) |
According to my root cause analysis, |
|
Hi there! I was waiting for this issue to be merged. Is there a timeline when that would happen @Ahmed-Tawfik94 @ntohidi |
Summary
Fixed a critical bug in AsyncUrlSeeder where
_resolve_head()was incorrectly returning redirect targets without verifying they were alive. This could cause dead URLs to be treated as valid during URL discovery.#1603
Key Changes:
verify_redirect_targetsparameter toAsyncUrlSeeder.__init__()(default:True)_resolve_head()to conditionally verify redirect targets based on the parameterBackward Compatibility: Existing code continues to work with improved behavior by default. Users can set
verify_redirect_targets=Falseto restore the previous behavior if needed.List of files changed and why
crawl4ai/async_url_seeder.py- Core bug fix and parameter additiontests/test_async_url_seeder.py- Added unit tests for both verification modestest_scripts/test_async_url_seeder_fixes.py- Comprehensive demo/test suite for all fixestest_scripts/README.md- Documentation for test scriptsHow Has This Been Tested?
Created comprehensive test suite covering:
verify_redirect_targets=FalseRun the test suite with:
python test_scripts/test_async_url_seeder_fixes.pyChecklist: