-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Open
Labels
🐞 BugSomething isn't workingSomething isn't working📌 Root causedidentified the root cause of bugidentified the root cause of bug
Description
crawl4ai version
0.7.8
Expected Behavior
Per Documentation:
- Total Score - Smart Combination
Intelligently combines intrinsic and contextual scores with fallbacks:
- When both scores available: (intrinsic * 0.3) + (contextual * 0.7)
- When only intrinsic: uses intrinsic score
- When only contextual: uses contextual score
- When neither: not calculated
head_data is returned.
Current Behavior
Below is an example output:
As you can see there is no intelligent fallback even though an intrinsic score is available.
Related, but it seems like I'm not getting head_data back either. That could be the pdf files I'm looking at.
{
"https://www.pa.gov/en/grants/search/grant-details/dced/9": [
{
"url": "https://dced.pa.gov/programs/ben-franklin-technology-development-authority-venture-investment-program",
"text": "Program Page",
"intrinsic_score": 5.0,
"contextual_score": null,
"total_score": null,
"head_data": null
},
{
"url": "https://grants.pa.gov/",
"text": "Go to Application (opens in a new tab)",
"intrinsic_score": 4.0,
"contextual_score": null,
"total_score": null,
"head_data": null
},
{
"url": "https://dced.pa.gov/download/bftda-venture-investment-program-guidelines?wpdmdl=87903",
"text": "Program Guidelines",
"intrinsic_score": 4.0,
"contextual_score": null,
"total_score": null,
"head_data": null
},
{
"url": "https://www.pa.gov/privacy-policy",
"text": "Privacy Policy(opens in a new tab)",
"intrinsic_score": 3.5,
"contextual_score": null,
"total_score": null,
"head_data": null
},
{
"url": "https://www.pa.gov/en/agencies/dced.html",
"text": "Visit the DCED Website (opens in a new tab)",
"intrinsic_score": 2.7857142857142856,
"contextual_score": null,
"total_score": null,
"head_data": null
}
]
}Additionally, from an 'apples to apples' comparison, total_score shouldn't just be the intrinsic value if contextual isn't available, but rather the weighted intrinsic value. It seemed like from the documentation total_score would simply represent the raw intrinsic score (if 5, then return 5, not 5 x .3 = 1.5)
| Scenario | Apples-to-Apples Approach | Resulting Formula / Score |
|---|---|---|
| Both available | Weighted average | |
| Only intrinsic | Uses weighted intrinsic score | |
| Only contextual | Uses weighted contextual score |
Is this reproducible?
Yes
Inputs Causing the Bug
See example above.Steps to Reproduce
See Code SnippetCode snippets
The relevant code looks something like this:
md_generator = DefaultMarkdownGenerator()
config = CrawlerRunConfig(
url_matcher=str(t.url),
markdown_generator=md_generator,
excluded_tags=['nav', 'footer', 'header'],
extraction_strategy=llm_strategy,
cache_mode=CacheMode.BYPASS,
stream=True,
score_links=True,
exclude_all_images=True,
link_preview_config=LinkPreviewConfig(
include_internal=True,
include_external=True,
max_links=20,
concurrency=5,
timeout=10,
query='my query here'
score_threshold=0.2,
),
)
Happy to provide more details privately.OS
Linux
Python version
3.12.3
Browser
Chrome
Browser version
144.0.7559.59
Error logs & Screenshots (if applicable)
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
🐞 BugSomething isn't workingSomething isn't working📌 Root causedidentified the root cause of bugidentified the root cause of bug