Oscarld/obfuscation http parity by Eldolfin · Pull Request #1684 · DataDog/libdatadog

Eldolfin · 2026-03-06T15:57:11Z

What does this PR do?

Ran a fuzzer to find output difference between this obfuscator and the agent's obfuscator, fixed issues one by one, even the nonsensical edge cases.

Motivation

Reach 100% parity between obfuscation libs.

Additional Notes

Anything else we should know when reviewing?

How to test the change?

Describe here in detail how the change can be validated.

github-actions · 2026-03-06T16:07:49Z

Clippy Allow Annotation Report

Comparing clippy allow annotations between branches:

Base Branch: origin/main
PR Branch: origin/oscarld/obfuscation-http-parity

Summary by Rule

Rule	Base Branch	PR Branch	Change
expect_used	0	1	⚠️ +1 (N/A)
unwrap_used	1	1	No change (0%)
Total	1	2	⚠️ +1 (+100.0%)

Annotation Counts by File

File	Base Branch	PR Branch	Change
`libdd-trace-obfuscation/src/http.rs`	1	2	⚠️ +1 (+100.0%)

Annotation Stats by Crate

Crate	Base Branch	PR Branch	Change
`clippy-annotation-reporter`	5	5	No change (0%)
`datadog-ffe-ffi`	1	1	No change (0%)
`datadog-ipc`	27	27	No change (0%)
`datadog-live-debugger`	6	6	No change (0%)
`datadog-live-debugger-ffi`	10	10	No change (0%)
`datadog-profiling-replayer`	4	4	No change (0%)
`datadog-remote-config`	3	3	No change (0%)
`datadog-sidecar`	59	59	No change (0%)
`libdd-common`	10	10	No change (0%)
`libdd-common-ffi`	12	12	No change (0%)
`libdd-crashtracker`	12	12	No change (0%)
`libdd-data-pipeline`	5	5	No change (0%)
`libdd-ddsketch`	2	2	No change (0%)
`libdd-dogstatsd-client`	1	1	No change (0%)
`libdd-profiling`	13	13	No change (0%)
`libdd-telemetry`	19	19	No change (0%)
`libdd-tinybytes`	4	4	No change (0%)
`libdd-trace-normalization`	2	2	No change (0%)
`libdd-trace-obfuscation`	9	10	⚠️ +1 (+11.1%)
`libdd-trace-utils`	15	15	No change (0%)
Total	219	220	⚠️ +1 (+0.5%)

About This Report

This report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality.

Go's url.Parse rejects bare '%' and other invalid percent-encoding sequences, returning an error which causes obfuscateURLString to return "?". The url crate silently re-encodes them as '%25', so add an explicit pre-check matching Go's behavior. Fixes fuzzing testcase: http_fuzzing_594901251

Go's url.Parse stores "." and ".." path segments literally, while the url crate's join() resolves them via RFC 3986 normalization (making them empty after stripping the base). Return the original input when go_like_reference returns empty for a non-empty input that already passed all error checks (control chars, invalid percent-encoding). Fixes fuzzing testcase: http_fuzzing_3638045804

Go's url.Parse succeeds for relative URLs (like "0") and applies path-digit removal to them. The Rust code was returning early from the go_like_reference path without applying digit removal. Add remove_relative_path_digits() helper and call it for relative URL results when remove_path_digits=true. Fixes fuzzing testcase: http_fuzzing_1928485962

…URLs Go's url.shouldEscape for encodePath does not allow !, ', (, ), * even though RFC 3986 considers them valid sub-delimiters in path segments. The url crate follows RFC 3986 and keeps them unencoded. Post-process go_like_reference output to encode these characters to match Go's behavior. Fixes fuzzing testcase: http_fuzzing_4273565798

Go's validEncoded() has an explicit allowlist for !, ', (, ), * so these are only re-encoded when the path has non-ASCII chars (which forces Go to call escape() instead of using RawPath). For pure-ASCII inputs, Go's EscapedPath() returns the RawPath unchanged, keeping ! as-is. Only apply encode_go_path_chars() when the original input contains non-ASCII. Fixes fuzzing testcase: http_fuzzing_1457007156

Go's url.Parse percent-encodes non-ASCII chars in fragments (e.g., '#ჸ' → '#%E1%83%B8'). Our early-return fragment handler was returning the raw fragment without encoding. Delegate non-empty fragments to go_like_reference which uses the url crate's join() to correctly encode them. Fixes fuzzing testcase: http_fuzzing_1092426409

…e error) Go's url.Parse rejects ":" (missing protocol scheme) and "1:b" (first path segment cannot contain colon per RFC 3986 §4.2). The url crate accepts them as path characters. Add an explicit check to return "?" for these inputs. Fixes fuzzing testcase: http_fuzzing_3119724369

Go's url.Parse percent-encodes control chars in fragments (e.g., '#\x01' → '#%01'). The url crate silently drops them from fragments, returning '#'. Pre-encode control bytes in the fragment manually before passing to go_like_reference via base.join(). Fixes fuzzing testcase: http_fuzzing_1323831861

The previous fix iterated bytes and used 'b as char' which converts u8 to a Unicode scalar, corrupting multi-byte sequences like Georgian ჸ. Iterate over chars instead to preserve multi-byte Unicode correctly. Fixes fuzzing testcase: http_fuzzing_35626170

github-actions · 2026-03-06T16:09:53Z

📚 Documentation Check Results

⚠️ 520 documentation warning(s) found

📦 `libdd-trace-obfuscation` - 520 warning(s)

Updated: 2026-03-06 16:25:24 UTC | Commit: 176984b | missing-docs job results

Go's shouldEscape always encodes '\', '^', '{', '}', '|', '<', '>', '`', and ' ' in paths (they're not in validEncoded's allowlist). The url crate keeps them unencoded. Separate from the '!'-etc. class which are only encoded when non-ASCII chars trigger the escape() fallback. Fixes fuzzing testcase: http_fuzzing_618280270

Like '!', '\'', '(', ')', '*', the '[' and ']' characters are in Go's validEncoded allowlist but get encoded when escape() is called due to non-ASCII chars in the path. Fixes fuzzing testcase: http_fuzzing_1505427946

…ath split The url crate treats '\' as a path separator, consuming it silently. Go treats '\' as a path character and encodes it as '%5C'. Pre-encode '\' as '%5C' before calling go_like_reference so base.join() preserves it rather than using it as a path segment separator. Fixes fuzzing testcase: http_fuzzing_backslash_unicode

…ring() Go's url.URL.String() omits a bare '#' with no fragment content. The url crate keeps it. Strip trailing '#' from go_like_reference results. Fixes fuzzing testcase: http_fuzzing_2438023093

… ! etc. When determining whether to encode !, ', (, ), *, [, ] (Cat2 chars), only check the path portion (before '#') for non-ASCII bytes. A non-ASCII character in the fragment does not trigger Go's escape() fallback for the path, so the path chars should stay unencoded. Fixes fuzzing testcase: http_fuzzing_2729083127

…eference For input '/ჸ', base.join('/ჸ') resolves to 'https://example.invalid/%E1%83%B8'. Stripping base_prefix='https://example.invalid/' (with trailing slash) drops the leading '/'. For inputs starting with '/', use the no-trailing-slash strip to preserve the leading '/' in the output. Fixes fuzzing testcase: http_fuzzing_slash_unicode

Go's shouldEscape('#', encodeFragment) returns true, so '#' within a fragment is encoded as '%23'. The url crate keeps it raw. For input '##', Go returns '#%23' (second '#' encoded). Pre-encode '#' in fragment content. Fixes fuzzing testcase: http_fuzzing_3710129001

Go's url.Parse rejects control chars in the PATH (returning '?') but percent-encodes them in the FRAGMENT. Only check path portion (before '#') for control char rejection. Pre-encode control chars in the fragment before calling go_like_reference. Fixes fuzzing testcase: http_fuzzing_1009954227

encode_go_path_chars was operating on the whole URL string including fragment. Go's encodeFragment mode allows '!', '(', ')', '*' (shouldEscape returns false). Stop path-char encoding at '#' so the fragment portion is preserved unchanged. Fixes fuzzing testcase: http_fuzzing_hash_exclamation

Go's url.Parse rejects invalid percent-encoding sequences even in the fragment portion. Add the same check to the fragment handler. Fixes fuzzing testcase: http_fuzzing_578834728

Go's escape() for encodeFragment encodes these chars (they're in validEncoded's allowlist but not in shouldEscape's 'return false' cases). When non-ASCII chars trigger the escape() fallback, these also get encoded. Pre-encode them in the fragment when non-ASCII is detected. Fixes fuzzing testcase: http_fuzzing_3991369296

…CII fragment When the URL has non-ASCII chars in the fragment, Go's escape() also encodes cat2 chars (!, ', (, ), *, [, ]) in the fragment. Apply the same encoding to the result's fragment portion when the original URL's fragment had non-ASCII. Fixes fuzzing testcase: http_fuzzing_path_frag_quote

…n-ASCII present Go's shouldEscape for encodeFragment returns false for \! ( ) * explicitly, so those are NOT encoded even when escape() is triggered by non-ASCII. Only ' [ ] (in validEncoded allowlist but not shouldEscape's return-false) get encoded when non-ASCII chars trigger the escape() fallback.

…gits not path digits

…re-encode)

… to Cat2 for fragment

…riginal (not '?')

…(Go passthrough)

…) fallback

…uery string

…ue URIs - Opaque URIs ending with bare '#' (e.g. "C:#") now strip the empty fragment to match Go's url.URL.String() which omits it - When a URL has control chars in the fragment, also check the path for invalid percent-encoding before pre-encoding — previously this branch returned early and skipped the path validity check, causing inputs like "ჸ#%\u{1}" to return a percent-encoded result instead of "?"

…uote Go's url.EscapedPath() calls escape() on the whole path whenever validEncoded() returns false. validEncoded() returns false for any char not in its explicit allowlist — including '\"' (double-quote). When escape() is called, it also encodes Category 2 chars (!, ', (, ), *). Add '\"' to the has_cat1 trigger check so that inputs containing '\"' in the path also get Category 2 encoding, matching Go's behavior.

Go's url.unescape validates that percent-encoded bytes in path/fragment form valid UTF-8 sequences. The Rust implementation only checked for syntactically invalid percent-encoding (wrong hex digits count), missing cases like %80 (a lone UTF-8 continuation byte) which Go rejects. Fix: collect consecutive percent-encoded bytes and validate with from_utf8.

Two bugs fixed: 1. go_like_reference() dropped the fragment when stripping the query. Fix: after finding path_end (at '?'), extract the '#...' fragment and include it in the returned string. 2. obfuscate_url_string() returned '?' for '?#frag' inputs with remove_query_string=true, discarding the fragment entirely. Fix: when after_q starts with '#' (empty query + fragment), fall through to go_like_reference which encodes and preserves it.

When remove_query_string=true and the URL has both a query and a fragment (e.g. "?ჸ#ჸ"), the previous fix only handled "?#frag" (empty query). Extend the fix to any URL starting with '?' that contains a '#' fragment. Fall through to go_like_reference which strips the query and preserves the fragment with correct percent-encoding.

Go's url.URL.String() omits an empty trailing fragment (bare '#'). For query-only URL references like '?query#', the previous code returned the original string including the bare '#', while Go returns '?query'.

For URLs starting with '?' that have a fragment (e.g. '?#ჸ'), Go's url.URL.String() percent-encodes non-ASCII chars in the fragment via EscapeFragment. Also, Go omits an empty trailing fragment ('?#' → '?'). Handle these cases early before the 'restore original query' pass which would otherwise undo the encoding.

The restore-original-query pass was splicing &url[q_start..] which includes any trailing '#' (empty fragment), overriding the empty-fragment stripping done by go_like_reference. Now only restores up to '#', and appends the (already-encoded/stripped) fragment from go_like_reference.

Go's url.Parse decodes %XX sequences where the decoded byte is an unreserved char (A-Z, a-z, 0-9, -, ., _, ~) as part of path normalization. E.g. %30 → 0, %41 → A. The url crate preserves them as-is. Add normalize_pct_encoded_unreserved() and apply it in go_like_reference on the path portion of all returned values.

Go's url.Parse first splits on '#', then parses the pre-fragment portion. If that portion starts with ':' (empty scheme), getScheme returns "missing protocol scheme" and ObfuscateURLString returns '?'. The Rust code had a check for ':' in the first path segment, but it was placed after the CTL-in-fragment pre-encode block which returned early, so inputs like ":#<ctrl>" bypassed the check. Move the colon check to before the CTL-in-fragment block so it fires regardless of what the fragment contains. Fixes parity for input ":#\u{1}" (http_fuzzing_4114246193).

Copilot

Pull request overview

This PR aims to reach parity with Go’s URL parsing/serialization behavior for obfuscation by adding Go-specific normalization, encoding, and edge-case handling found via fuzzing.

Changes:

Added Go-like URL reference parsing fallback and percent-encoding/normalization helpers.
Expanded obfuscation logic to mirror Go’s handling of control chars, fragments, opaque URIs, and query preservation.
Added many fuzz/regression tests for parity edge cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T16:12:04Z