Skip to content

feat: add support parse_url#3563

Open
rafafrdz wants to merge 12 commits intoapache:mainfrom
rafafrdz:feat/add-support-parse_url
Open

feat: add support parse_url#3563
rafafrdz wants to merge 12 commits intoapache:mainfrom
rafafrdz:feat/add-support-parse_url

Conversation

@rafafrdz
Copy link
Contributor

@rafafrdz rafafrdz commented Feb 21, 2026

Summary

  • Added support for Spark parse_url by mapping ParseUrl to the native scalar function parse_url in expression serde.
  • Updated spark_expressions_support.md to mark parse_url as supported.
  • Add tests

Why
This closes one of the missing DataFusion 50 migration functions from issue #2443

Part of #2443

@rafafrdz rafafrdz marked this pull request as draft February 21, 2026 10:22
@rafafrdz rafafrdz marked this pull request as ready for review February 21, 2026 11:05
@mbutrovich
Copy link
Contributor

Thanks @rafafrdz! Kicking off CI.

@andygrove
Copy link
Member

This looks good so far @rafafrdz.

CI shows all three parse_url tests failing on Spark 4.0 with COMET: invoke is not supported. This happens because ParseUrl was changed to RuntimeReplaceable in Spark 4.0. It gets replaced by an Invoke expression during analysis, so Comet's serde never sees the ParseUrl node.

On the compatibility side, I noticed that the PR doesn't distinguish between ANSI and legacy mode for invalid URL handling. DataFusion's parse_url throws errors on malformed URLs, which matches Spark's behavior when failOnError=true (ANSI mode), but in legacy mode (failOnError=false), Spark returns null for invalid URLs instead of throwing.

@rafafrdz
Copy link
Contributor Author

Thanks for the suggestion @andygrove

What I did:

  • Spark 4.0 RuntimeReplaceable / Invoke path: ParseUrl is now handled when Spark rewrites it to Invoke(ParseUrlEvaluator.evaluate, ...), so Comet serde no longer misses it and we avoid "the invoke is not supported" fallback (please check it out if it's correct or not)
  • ANSI vs legacy behavior: serde now respects failOnError:
    • failOnError = true (ANSI) -> parse_url (error on malformed URLs)
    • failOnError = false (legacy) -> try_parse_url (returns NULL on malformed URLs)

I also registered try_parse_url on the native side and added a test for invalid URL handling in legacy mode (Spark 4.0 path).

@rafafrdz rafafrdz marked this pull request as draft February 26, 2026 10:59
rafafrdz added 4 commits March 4, 2026 12:53
- Resolve conflict in jni_api.rs: add ParseUrl/TryParseUrl registrations
  alongside SparkSpace and SparkBitCount from upstream
- Update datafusion-spark dependency to 52.2.0 (via upstream Cargo.toml)
- Simplify CometParseUrl serde:
  - Spark 3.5 path reads failOnError directly from ParseUrl.failOnError
    (no reflection needed)
  - Spark 4.0 Invoke path uses reflection only on ParseUrlEvaluator
  - Remove dead 'invoke' branch from convertExpression; rename
    entry point to convertFromInvoke for clarity
- Expand parse_url.sql test: add PATH/FILE/REF/AUTHORITY/USERINFO
  queries, ftp URL row, and try_parse_url coverage
- Add CometStringExpressionSuite tests: REF/AUTHORITY/USERINFO parts,
  ANSI mode (Spark 3.5), and a dedicated try_parse_url test
- Remove try_parse_url SQL queries: try_parse_url is a Comet-internal
  DataFusion function name used when serializing parse_url with
  failOnError=false. It is not a registered Spark SQL function and
  calling it directly via SQL raises UNRESOLVED_ROUTINE on any Spark
  version. The NULL-on-invalid-URL behaviour is covered by the
  'parse_url with invalid URL in legacy mode' Scala test.
- Replace ftp port 21 with 2121 in test URLs: the Rust url crate omits
  well-known default ports (ftp=21) when serialising authority(), while
  Java URI.getRawAuthority() preserves them verbatim. Using a
  non-default port avoids this pre-existing semantic gap and keeps the
  AUTHORITY test case meaningful on both Spark 3.5 and 4.0.
@rafafrdz rafafrdz force-pushed the feat/add-support-parse_url branch from 7c8a8d9 to 9d7720f Compare March 4, 2026 15:01
@rafafrdz rafafrdz marked this pull request as ready for review March 4, 2026 15:02
Use Expression directly instead of a generic type parameter T <: Expression.
This eliminates the asInstanceOf cast in convertFromInvokeUnchecked and the
unchecked method itself, since QueryPlanSerde can now call convertFromInvoke
directly without existential type issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants