feat(string): implement RopeString with thin vtable and lazy flattening#5006
feat(string): implement RopeString with thin vtable and lazy flattening#5006akash-R-A-J wants to merge 14 commits intoboa-dev:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5006 +/- ##
===========================================
+ Coverage 47.24% 58.72% +11.47%
===========================================
Files 476 565 +89
Lines 46892 63124 +16232
===========================================
+ Hits 22154 37069 +14915
- Misses 24738 26055 +1317 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
core/string/src/vtable/rope.rs
Outdated
| if depth > 32 { | ||
| // Auto-flatten if we hit the depth limit, unless the string is "insanely" large. | ||
| // This bounds access time and recursion depth for other components. | ||
| if left.len() + right.len() < 1_000_000 { | ||
| let mut vec = Vec::with_capacity(left.len() + right.len()); | ||
| for s in [&left, &right] { | ||
| match s.variant() { | ||
| crate::JsStrVariant::Latin1(l) => { | ||
| vec.extend(l.iter().map(|&b| u16::from(b))); | ||
| } | ||
| crate::JsStrVariant::Utf16(u) => vec.extend_from_slice(u), | ||
| } | ||
| } | ||
| return JsString::from(&vec[..]); | ||
| } | ||
| } | ||
|
|
||
| let rope = Box::new(Self { | ||
| header: RawJsString { | ||
| vtable: &ROPE_VTABLE, | ||
| len: left.len() + right.len(), | ||
| refcount: 1, | ||
| kind: JsStringKind::Rope, | ||
| hash: 0, | ||
| }, | ||
| left, | ||
| right, | ||
| flattened: OnceCell::new(), | ||
| depth, | ||
| }); |
There was a problem hiding this comment.
You might need to rebalance the rope following some kind of heuristic. Otherwise, you kinda lose the time complexity of O(log N) to get an element, since one side could get arbitrarily large, which converts indexing into an O(N) operation.
There was a problem hiding this comment.
Yes, I've addressed rebalancing in two ways:
Batch Concatenations: The concat_strings_balanced helper uses a mid-point split to build a perfectly balanced tree (log N depth).
Incremental Concatenations: I've implemented a hard depth limit of 32 in RopeString::create. If this limit is exceeded, the rope auto-flattens into a contiguous string. This effectively bounds indexing to O(32), which is O(1) in practice, preventing the O(N) worst-case.
This keeps the implementation simple while providing a guaranteed bound on access time
There was a problem hiding this comment.
The concat_strings_balanced helper uses a mid-point split to build a perfectly balanced tree (log N depth).
That's not a balanced tree because you only consider the top strings, not if the strings themselves are ropes. If you pass an array of 3 strings, where 2 strings are ropes of depth 2 and 1 string is a rope of depth 28, you will have a very unbalanced tree.
This effectively bounds indexing to O(32), which is O(1) in practice, preventing the O(N) worst-case.
This is not O(N) in practice, because if you have thousands of strings of depth 32 and you need to index through all of them, you would pay the cost of O(N).
There was a problem hiding this comment.
The current implementation uses a depth cap to avoid pathological trees, but you're right that it doesn't guarantee logarithmic access in all cases.
Would it make sense to adopt a Fibonacci-based rebalancing heuristic (similar to classical rope implementations) so that we guarantee weight-balanced ropes instead of relying on flattening?
There was a problem hiding this comment.
Yep, that's pretty much the standard for ropes
core/string/src/lib.rs
Outdated
| // SAFETY: Caller must ensure the type matches. | ||
| unsafe { self.ptr.cast::<T>().as_ref() } | ||
| #[must_use] | ||
| pub fn as_str(&self) -> JsStr<'static> { |
There was a problem hiding this comment.
This is unsound. It's returning a 'static reference to a temporary JsString, so you could deallocate the original string and the compiler will happily let you access the JsStr<'static>:
let s = JsString::from_str("undefined behaviour!");
let temp = s.as_str();
drop(s)
println!("{}", temp.display_lossy()); // UB!There was a problem hiding this comment.
Thanks for pointing this out! you're absolutely right, that was a leftover from the initial pointer-based POC while focusing on the memory layout. I've now refactored the vtable to use HRTBs for as_str which replaces that temporary shim with a lifetime binding. It was my fault.
zhuzhu81998
left a comment
There was a problem hiding this comment.
ci shows segmentaion fault (potential cause below)
| let specifier = specifier.cow_replace('/', "\\"); | ||
|
|
||
| let short_path = Path::new(&specifier); | ||
| let short_path = Path::new(&*specifier); |
There was a problem hiding this comment.
what does this have to do with rope strings?
| pub(crate) mod rope; | ||
| pub(crate) use rope::RopeString; | ||
|
|
||
| /// Header for all `JsString` allocations. |
There was a problem hiding this comment.
Then call this JsStringHeader instead of RawJsString perhaps?
There was a problem hiding this comment.
sounds good, will replace RawJsString with the suggested JsStringHeader
core/string/src/vtable/mod.rs
Outdated
| unsafe impl Sync for RawJsString {} | ||
| // SAFETY: RawJsString contains only thread-safe data. | ||
| unsafe impl Send for RawJsString {} |
There was a problem hiding this comment.
do you need this somewhere?
core/string/src/vtable/mod.rs
Outdated
| // SAFETY: We only mutate refcount and hash via atomic-casts when kind != Static. | ||
| unsafe impl Sync for JsStringVTable {} | ||
| // SAFETY: JsStringVTable contains only thread-safe data. | ||
| unsafe impl Send for JsStringVTable {} |
There was a problem hiding this comment.
same question: where is the need?
| /// A rope string that is a tree of other strings. | ||
| #[repr(C)] | ||
| pub(crate) struct RopeString { | ||
| /// Standardized header for all strings. | ||
| pub(crate) header: RawJsString, | ||
| pub(crate) left: JsString, | ||
| pub(crate) right: JsString, | ||
| flattened: OnceCell<JsString>, | ||
| pub(crate) depth: u8, | ||
| } |
There was a problem hiding this comment.
hmmm so it is my understanding that JsString here will most likely GC-tracked. And there is no way for GC to know that RopeString is holdig reference to the left and right here?
are yo doing something to prevent this?
There was a problem hiding this comment.
To clarify: JSStrings are not GC-tracked, they're independently ref-counted, so it's fine to not trace through them. I'm fairly certain the UB comes from the modifications made to as_str that removed the lifetime inheritance for JsStr<'static>.
There was a problem hiding this comment.
indeed its not gc. disabling the gc does not resolve the segmentation fault XD.
6587b0c to
951985f
Compare
Test262 conformance changes
Broken tests (1):Tested main commit: |
core/string/src/vtable/rope.rs
Outdated
| pub(crate) header: JsStringHeader, | ||
| pub(crate) left: JsString, | ||
| pub(crate) right: JsString, | ||
| flattened: OnceLock<JsString>, |
There was a problem hiding this comment.
Uhh I doubt you need a thread safe OnceCell here. JsString implements !Send and !Sync, so it's kinda unnecessary to make this thread safe.
There was a problem hiding this comment.
Thanks! I've switched it to OnceCell to avoid any unnecessary thread-safe overhead.
There was a problem hiding this comment.
Reverting this back to OnceLock as OnceCell panics on reentrant initialization, and since rope trees can share nodes symmetrically, flattening may recursively call .as_str() on the same node, leading to reentrant initialization and test262 failures.
There was a problem hiding this comment.
Uhh if you have that behaviour that's a bug though... because if you have a children that recursively references itself, things like code_point_at could infinitely recurse.
There was a problem hiding this comment.
This is not a self reference cycle. Since JsString is refcounted, ropes can share subtrees and form DAGs. During flattening the same node may be reached through multiple paths before its cache is initialized, which triggers reentrant initialization in OnceCell.
There was a problem hiding this comment.
Then I don't see the reentrancy. If you have a children that is shared between two nodes, the first access would build the flat version, and the second access would just fetch the already constructed array, no?
But I think this is all null because your code already does a DFS so it should only call as_str for the leaves, which are never ropes.
There was a problem hiding this comment.
Ahh, you're right that the DFS traversal only calls as_str() on leaf strings, not ropes, so rope_as_str itself shouldn't recursively reenter. The earlier panic I observed with OnceCell likely came from another path (possibly inside concat_array). I'll investigate that path more closely.
|
@jedel1043 I believe all requested changes have now been addressed. CI is green across all platforms, and test262 results match Please let me know if anything else needs adjustment. |
c21f4cb to
b6134ad
Compare
| // Using a raw buffer cache instead of `JsString` in `OnceCell` solves the "refcount aliasing" | ||
| // problem. Storing a refcounted `JsString` inside a `OnceCell` could lead to ownership cycles | ||
| // or use-after-free errors if shared nodes were dropped while still cached. | ||
| // By storing raw `Box<[u8]>` or `Box<[u16]>`, we decouple the cache from the refcounting system. | ||
| flattened: OnceCell<Flattened>, |
There was a problem hiding this comment.
Sorry it should be Cell<Option<JsString>>
There was a problem hiding this comment.
Otherwise you won't be able to modify it 😅
There was a problem hiding this comment.
Ah wait that doesn't work, you need to have something referenceable
There was a problem hiding this comment.
Hmmm... what if we just make as_str return Option<JsStr>? That kinda makes more sense since flattening ropes should be reserved to only exceptional situations such as reaching the maximum depth.
There was a problem hiding this comment.
This also nicely reflects in our inner API that not all strings have a JsStr available to use, and the caller should be the one responsible for accessing the string in other ways if that's the case
There was a problem hiding this comment.
That makes sense from an API design perspective. However, changing as_str() to return Option<JsStr> would require a fairly large refactor across the engine since many built-ins and internal methods currently rely on it always returning a JsStr. Would it be okay if I explore that in a separate follow-up PR?
This Pull Request fixes/closes #5005 .
This PR introduces Rope strings to
boa_stringand integrates them into the engine’s concatenation pipeline to eliminate the pathological O(N²) behavior of repeated string concatenation.Key Changes
Rope Strings
RopeStringrepresentation (core/string/src/vtable/rope.rs)Engine Integration
JsValue::addnow creates ropes for large concatenations (len(lhs) + len(rhs) > 1024)ConcatToStringupdated to use balanced batch concatenationString.prototype.concatrefactored to use the new APIThin VTable & Performance Recovery
clone,drop,as_str,code_unit_at)u64hash to speed up property lookupsptr_eq+ hash checks to optimize equality comparisonsBenchmarks
V8 Combined Suite
≈ 1% improvement (within noise)
Concatenation Stress Test
≈ 2.7× faster
Summary
This change removes quadratic concatenation behavior while maintaining baseline performance for general workloads. The thin vtable and devirtualization ensure that ropes introduce minimal overhead when not used.