feat: Add glyph atlas cache with LRU eviction for vello spare strips renderers#548
feat: Add glyph atlas cache with LRU eviction for vello spare strips renderers#548
Conversation
61beb5f to
c299b2b
Compare
e7a0634 to
96e3aff
Compare
|
Some tests are failing due to minor differences, I skimmed through the Kompari HTML report, and they’re very likely caused by recent changes in Vello that slightly affect the rendered output. |
LaurenzV
left a comment
There was a problem hiding this comment.
Still a lot missing, but that's as far as I got today. :( The overall approach looks great to me though!
| } | ||
|
|
||
| /// Print all collected timings, split into one-time and per-frame groups. | ||
| pub fn print_summary(&self) { |
There was a problem hiding this comment.
Am I understanding correctly that we have a number of different frames which exercise different paths (different layouts, different caching behavior, etc.), and then we are summarizing the number of all frames across the different stages? If yes, isn't that a bit meaningless since each frame tests something different? So the numbers will just always vary wildly.
There was a problem hiding this comment.
The aggregated stats are intentional, they give an amortized view across the full cache lifecycle (cold, warm, partial overlap, etc.), which is useful for comparing overall cached vs uncached performance. Each phase also prints its own per-phase stats. This is a development/benchmarking aid and the mixed summary is helpful for getting a quick read on the overall picture. Happy to revisit the output format later if needed.
| /// Uses `queue.write_texture` to write transparent pixels to each clear rect, | ||
| /// preventing stale data from evicted glyphs from bleeding through when the | ||
| /// slot is reused on a subsequent frame. | ||
| fn clear_atlas_regions(queue: &wgpu::Queue, renderer: &Renderer, rects: &[PendingClearRect]) { |
There was a problem hiding this comment.
Wouldn't it be possible to clear all regions in one pass? Is this just for simplicity for now?
There was a problem hiding this comment.
Yes, this is just for simplicity for now. The proper approach would be to use vello_hybrid's GPU clear shader to clear all rects in a single render pass, but that requires exposing a batched clear API on vello_hybrid's Renderer. I'll do that as a follow-up in Vello. For now I've at least reused a single zeroed buffer across rects to avoid the per-rect allocation.
taj-p
left a comment
There was a problem hiding this comment.
Looking good! Just flushing my comments before I go into meetings
parley_draw/src/atlas/cache.rs
Outdated
| /// Take all pending bitmap uploads, leaving the internal queue empty. | ||
| fn take_pending_uploads(&mut self) -> Vec<PendingBitmapUpload>; | ||
|
|
||
| /// Take all pending atlas command recorders (one per dirty page), | ||
| /// leaving the internal collection empty. | ||
| fn take_pending_atlas_commands(&mut self) -> Vec<AtlasCommandRecorder>; | ||
|
|
||
| /// Take all pending clear rects, leaving the internal queue empty. | ||
| /// | ||
| /// Each rect describes an atlas region that was freed during | ||
| /// [`maintain`](GlyphCache::maintain) and must be zeroed to transparent. | ||
| /// Drain these **after** calling `maintain`. | ||
| fn take_pending_clear_rects(&mut self) -> Vec<PendingClearRect>; |
There was a problem hiding this comment.
take_pending_atlas_commands uses mem::take, which transfers ownership of the SmallVec and every inner Vec<AtlasCommand> to the caller. After replay the caller drops everything, so the heap allocations are lost every frame and re-allocated from scratch next frame.
Since consumers only need to iterate and drain the commands, we could avoid this by keeping the recorders in place and letting consumers borrow them. Something like:
pub fn replay_pending_atlas_commands(
&mut self,
mut f: impl FnMut(&mut AtlasCommandRecorder),
) {
for slot in &mut self.pending_atlas_commands {
if let Some(recorder) = slot.as_mut() {
if !recorder.commands.is_empty() {
f(recorder);
recorder.commands.clear();
}
}
}
}Consumers change from:
for mut recorder in glyph_caches.glyph_atlas.take_pending_atlas_commands() {
glyph_renderer.reset();
replay_atlas_commands(&mut recorder.commands, glyph_renderer);
// ...
}to
glyph_caches.glyph_atlas.replay_pending_atlas_commands(|recorder| {
glyph_renderer.reset();
replay_atlas_commands(&recorder.commands, glyph_renderer);
// ...
});The benefit of this approach is that the command list for each AtlasCommandRecorder isn't freed each frame. I think that's beneficial for operations like zoom, where we need to insert into the cache every frame (since resolved font size will presumably change each frame).
There was a problem hiding this comment.
Good idea! Switched to a replay_pending_atlas_commands callback API that keeps recorders in place and only clears their command Vecs after replay, preserving capacity across frames. For the CPU backend, added a replay_pending_atlas_commands_with_pixmaps helper that splits the borrow to give the closure access to both the recorder and the page pixmaps.
parley_draw/src/glyph.rs
Outdated
| /// and then render that into the actual scene, in a similar fashion to | ||
| /// bitmap glyphs. | ||
| pub struct ColorGlyph<'a> { | ||
| pub struct ColrGlyph<'a> { |
There was a problem hiding this comment.
Since we renamed the other similar structs to use the a "Glyph" prefix, should we do the same here?
There was a problem hiding this comment.
Oh yeah, of course! Renamed ColrGlyph → GlyphColr for consistency with the other structs.
| &mut self, | ||
| glyph: PreparedGlyph<'_>, | ||
| glyph_atlas: &mut C, | ||
| image_cache: &mut ImageCache, |
There was a problem hiding this comment.
Is the plan to (sometime in the future) move these vello_common dependencies (and the shelf allocator) into a separate crate so that Glifo doesn't depend on vello_common?
There was a problem hiding this comment.
Yes, the long-term plan is to extract shared types (color, basic geometry) and the allocator into their own crate(s) so glifo doesn't depend on vello_common directly. For now the coupling is pragmatic: those types are stable and we'd just be re-exporting the same things. Happy to track this as a follow-up issue if that would be helpful?
There was a problem hiding this comment.
Happy to track this as a follow-up issue if that would be helpful?
However you like 🙏 !
|
|
||
| /// Configuration for glyph cache eviction behavior. | ||
| #[derive(Clone, Debug)] | ||
| pub struct GlyphCacheConfig { |
There was a problem hiding this comment.
Is it possible to configure minimum font size (after which we use renderer sampling to interpolate smaller sizes) and a maximum size (after which we always directly render)?
There was a problem hiding this comment.
Added max_cached_font_size to GlyphCacheConfig (default 128 ppem) — glyphs above this threshold now bypass the atlas and render directly each frame.
For min_cached_font_size: this would involve caching at the clamped size and applying a downscale transform at composite time, plus adjustments to hinting, metrics, and subpixel offsets across all three glyph branches. I think it's a good idea but a bit too involved for this PR. I'll track it as a follow-up. Happy to hear what you think!
There was a problem hiding this comment.
I think it's a good idea but a bit too involved for this PR. I'll track it as a follow-up. Happy to hear what you think!
SGTM!
| image_cache: &mut ImageCache, | ||
| key: GlyphCacheKey, | ||
| raster_metrics: RasterMetrics, | ||
| ) -> Option<(u16, u16, AtlasSlot, &mut AtlasCommandRecorder)>; |
There was a problem hiding this comment.
It's unfortunate that this form of API necessitates a redundant hash map lookup on cache miss. I.e., on cache miss, we perform another hash map operation to insert (being unable to re-use the RawEntry or similar).
We could precompute the hash once and use something like HashMap::raw_entry_mut().from_key_hashed_nocheck(...) but let's maybe leave that as a follow up
There was a problem hiding this comment.
Agreed, the double lookup on miss is a known trade-off of the current get/insert split. I'll keep raw_entry_mut with precomputed hashes in mind as a follow-up optimisation. Thanks for flagging it.
| let cache_key = prepared_glyph.cache_key; | ||
| let transform = prepared_glyph.transform; | ||
| let tint_color = renderer.get_context_color(); | ||
|
|
||
| if let Some(ref key) = cache_key { | ||
| if let CacheResult::CachedAndRendered = render_outline_via_cache::<B>( | ||
| renderer, | ||
| &glyph.path, | ||
| transform, | ||
| key, | ||
| glyph_atlas, | ||
| image_cache, | ||
| tint_color, | ||
| ) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Isn't this the same code as for fill_glyphs? Where is the stroking happening here? 🤔
There was a problem hiding this comment.
Good catch, thanks! This is actually already like that on main — I just didn't notice. Split render_outline_directly into fill_outline_directly and stroke_outline_directly, with the stroke variant now correctly calling stroke_path.
parley_draw/src/glyph.rs
Outdated
| // Sub-pixel x offset is quantized into the cache key so that glyphs | ||
| // at different fractional positions get distinct cached bitmaps. | ||
| let cache_key = self.atlas_cache_enabled.then(|| { | ||
| let fractional_x = transform.translation().x.fract() as f32; |
There was a problem hiding this comment.
Now that we encode the fractional translation in the subpixel offset, do we not need to update the transform to discard the fraction from there? Also, why are y fractions not handled here?
There was a problem hiding this comment.
I think the fractional removal is handled in render_outline_glyph_from_atlas
There was a problem hiding this comment.
Good question! The fractional removal at render time is handled by render_outline_glyph_from_atlas which uses tx.floor()/ty.floor() — so the full transform is passed through but only the integer part is used for placement.
For y fractions: when hinting is enabled, the y translation is already rounded in calculate_outline_transform. When hinting is off, if I'm not missing something, vertical sub-pixel shifts are less perceptually impactful for horizontal text, and adding a y dimension would multiply cache entries. Added a comment to clarify this.
| // shifted to the glyph's slot within the page. Contrast with the hybrid | ||
| // backend, which gets per-allocation origins from the image cache. |
There was a problem hiding this comment.
Not sure I understand this, I thought vello_cpu and vello_hybrid work the same, where we have one (or multiple) huge atlas pages and glyphs are drawn into that at specific offsets. Why is there different behavior here?
There was a problem hiding this comment.
Both backends do use atlas pages with glyphs at specific offsets. The difference is in how the paint transform is set up when reading back from the atlas:
-
vello_cpu: The
ImageSourcereferences the entire page pixmap, so we need to shift the paint origin to the glyph's(x, y)position within the page. -
vello_hybrid: Vello's image cache resolves each allocation to its own origin, so we only need to compensate for the
GLYPH_PADDINGinset.
The underlying atlas structure is the same — it's just the coordinate system for the paint source that differs between the two rendering paths.
| vello_hybrid = { git = "https://github.com/linebender/vello.git", branch = "gemberg/glyph-cache-rect-allocator" } | ||
|
|
||
| [profile.profiling] | ||
| inherits = "release" |
There was a problem hiding this comment.
strip = "none" might also be worth adding!
There was a problem hiding this comment.
Added strip = "none" to make sure symbols are preserved for profiling.
| let padding = GLYPH_PADDING as f64; | ||
| Affine::translate((-padding, -padding)) |
There was a problem hiding this comment.
Do we not need glyph padding for vello_cpu?
There was a problem hiding this comment.
The vello_cpu doesn't need an explicit GLYPH_PADDING offset in paint_transform because atlas_slot.x / atlas_slot.y are already inset by GLYPH_PADDING from the allocation origin (see GlyphAtlas::insert_entry). So the translate to (-slot.x, -slot.y) already moves past the padding. The hybrid backend's image cache resolves to the allocation origin (before padding), so it needs the explicit (-padding, -padding) shift.
| /// A COLR glyph cached in the atlas. | ||
| /// The `Rect` parameter contains the fractional area dimensions | ||
| /// to preserve sub-pixel accuracy during rendering. | ||
| Colr(Rect), |
There was a problem hiding this comment.
Why do Colr and Outline have different behavior here? Because we don't use subpixel offsets here, since they are not as critical for emojis?
There was a problem hiding this comment.
Colr glyphs use fractional scaled_bbox dimensions for their area (computed from the glyph's actual bounding box), while outline glyphs use integer atlas_slot.width/height since they snap to pixel boundaries. The Rect carries these fractional bounds so the rendered quad matches the original glyph dimensions rather than the padded integer atlas allocation, which avoids subtle scaling artifacts.
| /// Outline and COLR glyph commands awaiting replay, indexed by atlas page. | ||
| /// Uses `SmallVec` with inline capacity of 1 because most applications use | ||
| /// a single atlas page; the common case avoids heap allocation entirely. | ||
| pending_atlas_commands: SmallVec<[Option<AtlasCommandRecorder>; 1]>, |
There was a problem hiding this comment.
Is there a way to configure a maximum value for this?
There was a problem hiding this comment.
Yes, the underlying AtlasConfig (from vello_common) already has a max_atlases field that caps the number of atlas pages. When the limit is reached, allocate() fails and the glyph is drawn directly without caching. This is configured on the ImageCache side rather than GlyphCacheConfig.
taj-p
left a comment
There was a problem hiding this comment.
Nice! This looks really good! Some comments
| hinted: false, | ||
| subpixel_x: 0, | ||
| context_color: BLACK, | ||
| var_coords: SmallVec::new(), |
There was a problem hiding this comment.
Depending on var_coords for the run, this could allocate every time. I think we either should consider:
- removing
var_coordsfrom the key and rely on the 2 level map structure (which encodes thevar_coordsanyway); or - we should use the
Equivalenttrait to create borrowed keys (shown below) and convert it to an owned key on cache misses only - seeparley/src/lru_cache.rsfor an example implementation
#[derive(Clone, Copy)]
pub struct GlyphCacheKeyRef<'a> {
pub font_id: u64,
pub font_index: u32,
pub glyph_id: u32,
pub size_bits: u32,
pub hinted: bool,
pub subpixel_x: u8,
pub context_color_packed: u32,
pub var_coords: &'a [NormalizedCoord],
}96e3aff to
d560c82
Compare
taj-p
left a comment
There was a problem hiding this comment.
LGTM 🎉 ! Let's goooo!!! I didn't re-read all the code line-by-line again, but skimmed through the comments and their resolutions. All seems absolutely BRILLIANT!
6c5bc19 to
12cc565
Compare
Introduces a glyph atlas cache that rasterizes glyphs once and reuses the bitmaps on subsequent frames, reducing redundant work for text-heavy scenes. The cache uses LRU age-based eviction, deterministic hashing for reproducible atlas packing, subpixel-quantized cache keys (supporting variable fonts and COLR glyphs), and a multi-page atlas backed by
vello_common::ImageCache.