feat: Add web research node with Tavily search integration#56
feat: Add web research node with Tavily search integration#561wos wants to merge 3 commits intoWithModulabs:v2-mainfrom
Conversation
|
@1wos is attempting to deploy a commit to the robertchoi's projects Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughAdds a health-check GET /api/chk, extends Post schemas with title/content/tags and response metadata, and integrates Tavily-based web research into the blog-writer flow (new WebResearch node, search_with_tavily/web_search functions, SearchProvider enum, and prompt updates). Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant HumanSelectKeywords as HumanSelectKeywords
participant WebResearch as WebResearch
participant TavilyAPI as Tavily API
participant WriteBlog as WriteBlog
participant State as BlogState
User->>HumanSelectKeywords: request blog with keywords
HumanSelectKeywords->>State: store selected_keywords
State-->>WebResearch: trigger with selected_keywords
WebResearch->>TavilyAPI: search keywords (top N)
TavilyAPI-->>WebResearch: return search results
WebResearch->>State: store search_results
State-->>WriteBlog: provide analyzed_content + selected_keywords + search_results
WriteBlog->>WriteBlog: format search_results_section and render blog
WriteBlog-->>User: return blog content with sources
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/cast_tests/blog_writer_test.py (1)
29-41:⚠️ Potential issue | 🟡 Minor
"web_research"is missing from the expected nodes list.The graph now contains 9 nodes (including the new
web_research), but this test still only checks for 8. While the test won't fail (it only verifies listed nodes exist, not exclusivity), it silently skips coverage for the newly added node.Proposed fix
expected_nodes = [ "fetch_content", "analyze_content", "suggest_keywords", "human_select_keywords", + "web_research", "write_blog", "optimize_seo", "generate_images", "convert_to_html", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/cast_tests/blog_writer_test.py` around lines 29 - 41, The expected_nodes list in the test (variable expected_nodes used in the loop with node_name) is missing the newly added "web_research" node; update the expected_nodes array to include "web_research" so the test checks for that node as well (i.e., add the string "web_research" alongside "fetch_content", "analyze_content", etc.) to ensure the graph's new node is covered by the assertions.
🧹 Nitpick comments (7)
api/schemas/post.py (3)
25-25:tagsis missingField(...)for consistency.
titleandcontentboth useField(None, ...)with descriptions/constraints, buttagsuses a bare default. This inconsistency makes the schema harder to read and leaves out API documentation for this field.♻️ Suggested fix
- tags: Optional[List[str]] = None + tags: Optional[List[str]] = Field(None, description="태그 목록")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/schemas/post.py` at line 25, The tags field in the Pydantic model is using a bare default (Optional[List[str]] = None) and should be made consistent with title/content by using Field to provide metadata; update the tags declaration (the tags attribute in the model) to use Field(None, description="List of tag strings for the post") (or similar wording consistent with your project's descriptions) so the schema includes API docs and any constraints.
6-11: Consider adding per-item max_length on tag strings.
tags: List[str]places no upper bound on the length of individual tag strings. Without amax_lengthconstraint on each item, an API client can submit arbitrarily long tag values, which may hit database column limits silently or be exploitable for oversized payloads.🛡️ Suggested constraint
- tags: List[str] = Field(default_factory=list, description="태그 목록") + tags: List[str] = Field(default_factory=list, description="태그 목록", + max_length=50) # adjust per-item limit as appropriatePydantic v2 applies
max_lengthto each item in aList[str]field.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/schemas/post.py` around lines 6 - 11, The tags field on PostBase currently allows arbitrarily long strings per item; update the tags definition (PostBase.tags) to include a per-item max_length via Field (e.g., Field(default_factory=list, max_length=50, description="태그 목록")) so each tag string is bounded (choose an appropriate max_length such as 50 or 100 for your DB constraints) and keep the default_factory and description intact.
1-3: Consider replacingtypinggenerics with built-in equivalents.The project targets Python 3.11, which natively supports
list[str]andstr | Nonesyntax without requiringfrom __future__ import annotations. Migrating from legacyListandOptionalaliases improves code readability and aligns with modern Python best practices.♻️ Suggested migration
-from pydantic import BaseModel, Field, ConfigDict -from typing import Optional, List -from datetime import datetime +from pydantic import BaseModel, Field, ConfigDict +from datetime import datetimeThen in the schema bodies:
- tags: List[str] = Field(...) + tags: list[str] = Field(...) - title: Optional[str] = Field(...) + title: str | None = Field(...)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/schemas/post.py` around lines 1 - 3, Replace legacy typing aliases with built-in generics in the post schema: remove imports of Optional and List from typing and update any type annotations using Optional[...] to the PEP 604 union form (e.g., str | None) and List[...] to the built-in bracket form (e.g., list[str]); keep imports of BaseModel, Field, ConfigDict and datetime intact and adjust any annotated fields in the BaseModel subclasses to use the new syntax so the file uses native Python 3.11 types.api/chk.py (1)
6-8: Consider stabilizing themessagestring.
"Standalone routing is working!"reads like a debug phrase rather than a stable health-check response. If this endpoint is permanent, a neutral string such as"ok"or"healthy"is more appropriate for consumers.✏️ Proposed change
- return {"status": "ok", "message": "Standalone routing is working!"} + return {"status": "ok", "message": "ok"}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/chk.py` around lines 6 - 8, The health-check handler check (decorated with `@app.get`("/api/chk")) returns a debug-like message; change its response to a stable, neutral value (e.g., {"status": "ok", "message": "healthy"} or simply {"status": "ok"}) so consumers get a consistent, production-safe health string; update the return value in the check function accordingly.casts/blog_writer/modules/state.py (1)
53-58: Consider addingsearch_providertoBlogWriterConfig.The other provider enums (
llm_provider,image_provider,scraper_type) are all configurable viaBlogWriterConfig, butSearchProvideris not. Currently there's only one search provider, so it's not urgent, but adding it now keeps the config surface consistent and prepares for future providers.Suggested addition
class BlogWriterConfig(BaseModel): """Configuration for Blog Writer cast.""" llm_provider: LLMProvider = LLMProvider.OPENAI image_provider: ImageProvider = ImageProvider.DALLE scraper_type: ScraperType = ScraperType.BEAUTIFULSOUP + search_provider: SearchProvider = SearchProvider.TAVILY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@casts/blog_writer/modules/state.py` around lines 53 - 58, Add a new field to BlogWriterConfig named search_provider of type SearchProvider with a sensible default (e.g., SearchProvider.DEFAULT) so the config surface matches the other provider enums; update the imports to include SearchProvider if not present and add search_provider: SearchProvider = SearchProvider.DEFAULT to the BlogWriterConfig class definition so code referencing BlogWriterConfig (and any Pydantic parsing) will include the search provider option.casts/blog_writer/modules/nodes.py (1)
214-250: WebResearch node: solid error handling, but the config-based API key path is likely dead code.The
api_keys.get_tavily_key()call at lines 227-230 requiresapi_keysto be an object with aget_tavily_keymethod. However,BlogState.configis typed asdict, soapi_keysobtained viastate["config"].get("api_keys")will typically be a dict (orNone). Thehasattrguard prevents a crash, but it meansapi_keywill always beNonehere, falling through to theTAVILY_API_KEYenv var insearch_with_tavily.This isn't broken (env var fallback works), but if you intend to support runtime API key injection via config, consider aligning with a dict-based approach:
Suggested fix
api_key = None if state.get("config"): api_keys = state["config"].get("api_keys") if api_keys: - api_key = ( - api_keys.get_tavily_key() - if hasattr(api_keys, "get_tavily_key") - else None - ) + api_key = ( + api_keys.get("tavily_key") + if isinstance(api_keys, dict) + else getattr(api_keys, "get_tavily_key", lambda: None)() + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@casts/blog_writer/modules/nodes.py` around lines 214 - 250, The config path in WebResearch.execute assumes api_keys is an object with get_tavily_key, but BlogState.config is a dict so that branch is effectively dead; update the logic that extracts api_key from state["config"].get("api_keys") to support dict-based injection (e.g., check if api_keys is a dict and read a well-known key like "tavily" or "tavily_api_key"), while still preserving the existing hasattr(api_keys, "get_tavily_key") branch for backward compatibility; ensure the value you extract is passed through to web_search (SearchProvider.TAVILY) so runtime config API keys override the env fallback.casts/blog_writer/modules/tools.py (1)
236-247: Error entries are included as search results and will appear in the LLM prompt.When a keyword search fails, the error message (line 245:
f"검색 실패: {e}") is stored in thecontentfield of a result dict. This entry flows throughWriteBlog's formatting (nodes.py line 276) and into the LLM prompt. While not a security issue, the LLM seeing error messages like stack traces or connection errors in the<web_research>block could degrade output quality.Consider filtering out error entries before passing to the prompt, or using a distinct marker so
WriteBlogcan skip them.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@casts/blog_writer/modules/tools.py` around lines 236 - 247, The current except block in tools.py appends an error-result dict (with content f"검색 실패: {e}") into all_results so it flows into WriteBlog (nodes.py WriteBlog formatting) and into the LLM prompt; change the handling so error entries are not treated as regular search results: either (A) do not append any result on exception (remove the all_results.append in the except), or (B) append a clearly typed marker such as {"keyword": keyword, "url":"", "title":"", "content":"", "is_error": True} and then update WriteBlog (nodes.py formatting function) to skip any result where is_error is True before building the <web_research> block. Ensure you update all call sites that expect the result shape to tolerate the new marker if you choose option B.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@casts/blog_writer/modules/tools.py`:
- Around line 153-203: The search_with_tavily function currently accepts and
passes an api_key to TavilySearch (which ignores it), doesn't parse the JSON
string returned by tool.ainvoke, and thus returns malformed results; update
search_with_tavily to remove the api_key parameter (or mark it unused),
keep/validate tavily_key via os.getenv("TAVILY_API_KEY") and raise if missing,
instantiate TavilySearch without an api_key argument, and after await
tool.ainvoke({"query": query}) call parse the returned JSON string with
json.loads() before normalizing the "results" list into the expected List[dict]
with keys "url", "title", "content".
In `@pyproject.toml`:
- Line 15: Add a Tavily API key field to the API key schema and wire header
extraction into the per-request key resolution: add a new field (e.g.,
TAVILY_API_KEY) to the APIKeys dataclass/schema in api/schemas/api_keys.py, then
update the get_api_keys dependency/function (where other X-... headers are read)
to accept and map the incoming "X-Tavily-API-Key" header into
APIKeys.TAVILY_API_KEY so multi-tenant callers can supply their own key per
request; ensure any Tavily client initialization consumes APIKeys.TAVILY_API_KEY
if present before falling back to env config.
---
Outside diff comments:
In `@tests/cast_tests/blog_writer_test.py`:
- Around line 29-41: The expected_nodes list in the test (variable
expected_nodes used in the loop with node_name) is missing the newly added
"web_research" node; update the expected_nodes array to include "web_research"
so the test checks for that node as well (i.e., add the string "web_research"
alongside "fetch_content", "analyze_content", etc.) to ensure the graph's new
node is covered by the assertions.
---
Nitpick comments:
In `@api/chk.py`:
- Around line 6-8: The health-check handler check (decorated with
`@app.get`("/api/chk")) returns a debug-like message; change its response to a
stable, neutral value (e.g., {"status": "ok", "message": "healthy"} or simply
{"status": "ok"}) so consumers get a consistent, production-safe health string;
update the return value in the check function accordingly.
In `@api/schemas/post.py`:
- Line 25: The tags field in the Pydantic model is using a bare default
(Optional[List[str]] = None) and should be made consistent with title/content by
using Field to provide metadata; update the tags declaration (the tags attribute
in the model) to use Field(None, description="List of tag strings for the post")
(or similar wording consistent with your project's descriptions) so the schema
includes API docs and any constraints.
- Around line 6-11: The tags field on PostBase currently allows arbitrarily long
strings per item; update the tags definition (PostBase.tags) to include a
per-item max_length via Field (e.g., Field(default_factory=list, max_length=50,
description="태그 목록")) so each tag string is bounded (choose an appropriate
max_length such as 50 or 100 for your DB constraints) and keep the
default_factory and description intact.
- Around line 1-3: Replace legacy typing aliases with built-in generics in the
post schema: remove imports of Optional and List from typing and update any type
annotations using Optional[...] to the PEP 604 union form (e.g., str | None) and
List[...] to the built-in bracket form (e.g., list[str]); keep imports of
BaseModel, Field, ConfigDict and datetime intact and adjust any annotated fields
in the BaseModel subclasses to use the new syntax so the file uses native Python
3.11 types.
In `@casts/blog_writer/modules/nodes.py`:
- Around line 214-250: The config path in WebResearch.execute assumes api_keys
is an object with get_tavily_key, but BlogState.config is a dict so that branch
is effectively dead; update the logic that extracts api_key from
state["config"].get("api_keys") to support dict-based injection (e.g., check if
api_keys is a dict and read a well-known key like "tavily" or "tavily_api_key"),
while still preserving the existing hasattr(api_keys, "get_tavily_key") branch
for backward compatibility; ensure the value you extract is passed through to
web_search (SearchProvider.TAVILY) so runtime config API keys override the env
fallback.
In `@casts/blog_writer/modules/state.py`:
- Around line 53-58: Add a new field to BlogWriterConfig named search_provider
of type SearchProvider with a sensible default (e.g., SearchProvider.DEFAULT) so
the config surface matches the other provider enums; update the imports to
include SearchProvider if not present and add search_provider: SearchProvider =
SearchProvider.DEFAULT to the BlogWriterConfig class definition so code
referencing BlogWriterConfig (and any Pydantic parsing) will include the search
provider option.
In `@casts/blog_writer/modules/tools.py`:
- Around line 236-247: The current except block in tools.py appends an
error-result dict (with content f"검색 실패: {e}") into all_results so it flows into
WriteBlog (nodes.py WriteBlog formatting) and into the LLM prompt; change the
handling so error entries are not treated as regular search results: either (A)
do not append any result on exception (remove the all_results.append in the
except), or (B) append a clearly typed marker such as {"keyword": keyword,
"url":"", "title":"", "content":"", "is_error": True} and then update WriteBlog
(nodes.py formatting function) to skip any result where is_error is True before
building the <web_research> block. Ensure you update all call sites that expect
the result shape to tolerate the new marker if you choose option B.
| async def search_with_tavily( | ||
| query: str, max_results: int = 3, api_key: str | None = None | ||
| ) -> list[dict]: | ||
| """Search the web using Tavily API. | ||
|
|
||
| Uses langchain-tavily TavilySearch tool (LangChain v1). | ||
|
|
||
| Args: | ||
| query: Search query string | ||
| max_results: Maximum number of results to return | ||
| api_key: Optional Tavily API key (falls back to env var) | ||
|
|
||
| Returns: | ||
| List of search results with url, title, content | ||
| """ | ||
| from langchain_tavily import TavilySearch | ||
|
|
||
| tavily_key = api_key or os.getenv("TAVILY_API_KEY") | ||
| if not tavily_key: | ||
| raise ValueError("TAVILY_API_KEY가 설정되어 있지 않습니다.") | ||
|
|
||
| tool = TavilySearch( | ||
| max_results=max_results, | ||
| api_key=tavily_key, | ||
| ) | ||
|
|
||
| results = await tool.ainvoke({"query": query}) | ||
|
|
||
| # Normalize results to a consistent format | ||
| if isinstance(results, dict) and "results" in results: | ||
| # TavilySearch returns {"query": ..., "results": [...]} | ||
| return [ | ||
| { | ||
| "url": r.get("url", ""), | ||
| "title": r.get("title", ""), | ||
| "content": r.get("content", ""), | ||
| } | ||
| for r in results["results"] | ||
| ] | ||
| if isinstance(results, str): | ||
| return [{"content": results, "url": "", "title": ""}] | ||
| if isinstance(results, list): | ||
| return [ | ||
| { | ||
| "url": r.get("url", ""), | ||
| "title": r.get("title", ""), | ||
| "content": r.get("content", ""), | ||
| } | ||
| for r in results | ||
| ] | ||
| return [] |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
langchain-tavily TavilySearch constructor parameters api_key max_results ainvoke
💡 Result:
langchain_tavily.TavilySearch constructor params
TavilySearch(...) is configured via keyword args like:
max_results: int = 5topic: str = "general"(e.g."general","news","finance")include_answer: boolinclude_raw_content: boolinclude_images: boolinclude_image_descriptions: boolinclude_favicon: boolinclude_usage: boolsearch_depth: str(e.g."basic","advanced")time_range: str | None(e.g."day","week","month","year")start_date: str | None/end_date: str | None(formatYYYY-MM-DD)include_domains: list[str] | None/exclude_domains: list[str] | None(pluscountryis shown in the GitHub README example)
[1][2]
api_key parameter?
The langchain-tavily docs show credentials are read from the TAVILY_API_KEY environment variable (not an api_key= constructor argument in the examples). [1][2]
ainvoke
Like other LangChain tools/runnables, you can call it asynchronously:
result = await tool.ainvoke({"query": "..."})ainvoke is the async counterpart to invoke (default implementation runs invoke in a thread if native async isn’t provided). [3]
Sources:
[1] Tavily LangChain integration docs (langchain-tavily)
[2] langchain-tavily README / PyPI description (parameters + defaults)
[3] LangChain API docs (Runnable/BaseTool ainvoke)
🌐 Web query:
TavilySearch ainvoke return format langchain-tavily results structure
💡 Result:
In LangChain’s Tavily integration, the async call await tool.ainvoke(...) returns the same shape as tool.invoke(...) (just executed asynchronously).
langchain-tavily (from langchain_tavily import TavilySearch)
1) Invoking with plain args
from langchain_tavily import TavilySearch
import json
tool = TavilySearch(max_results=5)
raw = await tool.ainvoke({"query": "euro 2024 host nation"})
data = json.loads(raw) # raw is a JSON stringraw is a JSON string representing an object like: (docs.langchain.com)
{
"query": "...",
"follow_up_questions": null,
"answer": null,
"images": [],
"results": [
{
"title": "...",
"url": "...",
"content": "...",
"score": 0.91,
"raw_content": null
}
],
"response_time": 1.67
}2) Invoking with a ToolCall (model-style)
tool_call = {"args": {"query": "euro 2024 host nation"}, "id": "1", "name": "tavily", "type": "tool_call"}
tool_msg = await tool.ainvoke(tool_call)
raw = tool_msg.content # JSON string (same structure as above)
data = json.loads(raw)When invoked as a ToolCall, you get a ToolMessage, whose .content is that same JSON string. (docs.langchain.com)
Legacy langchain-community (from langchain_community.tools import TavilySearchResults)
This older tool can be configured with response_format="content_and_artifact" so the ToolMessage carries a parsed artifact dict (in addition to JSON string content). (api.python.langchain.com)
Citations:
- 1: https://docs.langchain.com/oss/python/integrations/tools/tavily_search
- 2: https://docs.langchain.com/oss/python/integrations/tools/tavily_search
- 3: https://api.python.langchain.com/en/latest/community/tools/langchain_community.tools.tavily_search.tool.TavilySearchResults.html?utm_source=openai
Fix critical API integration bugs in search_with_tavily function (lines 153-203).
Three critical issues prevent this function from working:
-
api_keyparameter is not accepted byTavilySearch: The constructor does not recognize anapi_keyargument. Thelangchain-tavilyintegration reads the API key exclusively from theTAVILY_API_KEYenvironment variable. The passedapi_keyparameter (line 167) is silently ignored, making the function signature misleading. -
ainvokereturns a JSON string, not a dict: The call toawait tool.ainvoke({"query": query})returns a JSON string in the format{"query": "...", "results": [...], ...}, not a Python dict. The code does not parse this string. Whenisinstance(results, str)is True (line 181), it returns the entire JSON string as a single result's content field, producing malformed output instead of the expectedlist[dict]structure. -
The error-handling check is ineffective: The
tavily_keyvariable computed at line 164 is never actually used since theapi_keyparameter toTavilySearchis invalid. The ValueError will not prevent initialization if the environment variable is set.
Required fix:
- Remove the
api_keyparameter from theTavilySearchconstructor and ensureTAVILY_API_KEYis set in the environment. - Parse the JSON string returned by
ainvokeusingjson.loads()before attempting to access the"results"key. - Revise the function signature and documentation to clarify that the
api_keyparameter is unused or remove it entirely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@casts/blog_writer/modules/tools.py` around lines 153 - 203, The
search_with_tavily function currently accepts and passes an api_key to
TavilySearch (which ignores it), doesn't parse the JSON string returned by
tool.ainvoke, and thus returns malformed results; update search_with_tavily to
remove the api_key parameter (or mark it unused), keep/validate tavily_key via
os.getenv("TAVILY_API_KEY") and raise if missing, instantiate TavilySearch
without an api_key argument, and after await tool.ainvoke({"query": query}) call
parse the returned JSON string with json.loads() before normalizing the
"results" list into the expected List[dict] with keys "url", "title", "content".
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@casts/blog_writer/modules/tools.py`:
- Around line 168-169: Replace the private import from
langchain_tavily._utilities by using the public API: remove the "from
langchain_tavily._utilities import TavilySearchAPIWrapper" import and instead
import the public wrapper path "from langchain_community.utilities.tavily_search
import TavilySearchAPIWrapper" or, better, refactor to use TavilySearch directly
(the TavilySearch class from langchain_tavily) so you don't instantiate a
private wrapper; update all uses of TavilySearchAPIWrapper in this module (e.g.,
where you construct or call the wrapper) to either use the public
TavilySearchAPIWrapper symbol or call TavilySearch's official methods per the
migration guidance.
---
Duplicate comments:
In `@casts/blog_writer/modules/tools.py`:
- Around line 181-205: The code treats the value returned by tool.ainvoke as
possibly a dict but TavilySearch.ainvoke returns a JSON string, so parse the
JSON string into Python objects before the isinstance checks: call json.loads on
results (handle json.JSONDecodeError and fall back to the original string), then
run the existing branches that expect dict/list/str; update the logic around the
results variable used after tool.ainvoke (and any functions that consume it) so
the dict branch (the {"results": [...]}) is reachable and individual result
objects are extracted rather than wrapping the raw JSON blob into a single
content entry.
#53
langchain-tavily패키지 사용 (TavilySearch)Summary by CodeRabbit
New Features
New Data / Schema
Style
Chores