Skip to content

feat: Add glyph atlas cache with LRU eviction for vello spare strips renderers#548

Merged
grebmeg merged 6 commits intomainfrom
gemberg/glyph-cache
Mar 3, 2026
Merged

feat: Add glyph atlas cache with LRU eviction for vello spare strips renderers#548
grebmeg merged 6 commits intomainfrom
gemberg/glyph-cache

Conversation

@grebmeg
Copy link
Contributor

@grebmeg grebmeg commented Feb 13, 2026

Introduces a glyph atlas cache that rasterizes glyphs once and reuses the bitmaps on subsequent frames, reducing redundant work for text-heavy scenes. The cache uses LRU age-based eviction, deterministic hashing for reproducible atlas packing, subpixel-quantized cache keys (supporting variable fonts and COLR glyphs), and a multi-page atlas backed by vello_common::ImageCache.

vello_cpu_render

@grebmeg grebmeg force-pushed the gemberg/glyph-cache branch 2 times, most recently from 61beb5f to c299b2b Compare February 25, 2026 06:07
@grebmeg grebmeg changed the title [WIP] feat: glyph cache feat: Add glyph atlas cache with LRU eviction for vello spare strips renderers Feb 25, 2026
@grebmeg grebmeg force-pushed the gemberg/glyph-cache branch from e7a0634 to 96e3aff Compare February 26, 2026 04:48
@grebmeg
Copy link
Contributor Author

grebmeg commented Feb 26, 2026

Some tests are failing due to minor differences, I skimmed through the Kompari HTML report, and they’re very likely caused by recent changes in Vello that slightly affect the rendered output.

@grebmeg grebmeg marked this pull request as ready for review February 26, 2026 05:11
Copy link
Contributor

@LaurenzV LaurenzV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a lot missing, but that's as far as I got today. :( The overall approach looks great to me though!

}

/// Print all collected timings, split into one-time and per-frame groups.
pub fn print_summary(&self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly that we have a number of different frames which exercise different paths (different layouts, different caching behavior, etc.), and then we are summarizing the number of all frames across the different stages? If yes, isn't that a bit meaningless since each frame tests something different? So the numbers will just always vary wildly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregated stats are intentional, they give an amortized view across the full cache lifecycle (cold, warm, partial overlap, etc.), which is useful for comparing overall cached vs uncached performance. Each phase also prints its own per-phase stats. This is a development/benchmarking aid and the mixed summary is helpful for getting a quick read on the overall picture. Happy to revisit the output format later if needed.

/// Uses `queue.write_texture` to write transparent pixels to each clear rect,
/// preventing stale data from evicted glyphs from bleeding through when the
/// slot is reused on a subsequent frame.
fn clear_atlas_regions(queue: &wgpu::Queue, renderer: &Renderer, rects: &[PendingClearRect]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be possible to clear all regions in one pass? Is this just for simplicity for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is just for simplicity for now. The proper approach would be to use vello_hybrid's GPU clear shader to clear all rects in a single render pass, but that requires exposing a batched clear API on vello_hybrid's Renderer. I'll do that as a follow-up in Vello. For now I've at least reused a single zeroed buffer across rects to avoid the per-rect allocation.

Copy link
Contributor

@taj-p taj-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just flushing my comments before I go into meetings

Comment on lines +94 to +106
/// Take all pending bitmap uploads, leaving the internal queue empty.
fn take_pending_uploads(&mut self) -> Vec<PendingBitmapUpload>;

/// Take all pending atlas command recorders (one per dirty page),
/// leaving the internal collection empty.
fn take_pending_atlas_commands(&mut self) -> Vec<AtlasCommandRecorder>;

/// Take all pending clear rects, leaving the internal queue empty.
///
/// Each rect describes an atlas region that was freed during
/// [`maintain`](GlyphCache::maintain) and must be zeroed to transparent.
/// Drain these **after** calling `maintain`.
fn take_pending_clear_rects(&mut self) -> Vec<PendingClearRect>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take_pending_atlas_commands uses mem::take, which transfers ownership of the SmallVec and every inner Vec<AtlasCommand> to the caller. After replay the caller drops everything, so the heap allocations are lost every frame and re-allocated from scratch next frame.

Since consumers only need to iterate and drain the commands, we could avoid this by keeping the recorders in place and letting consumers borrow them. Something like:

pub fn replay_pending_atlas_commands(
    &mut self,
    mut f: impl FnMut(&mut AtlasCommandRecorder),
) {
    for slot in &mut self.pending_atlas_commands {
        if let Some(recorder) = slot.as_mut() {
            if !recorder.commands.is_empty() {
                f(recorder);
                recorder.commands.clear();
            }
        }
    }
}

Consumers change from:

for mut recorder in glyph_caches.glyph_atlas.take_pending_atlas_commands() {
    glyph_renderer.reset();
    replay_atlas_commands(&mut recorder.commands, glyph_renderer);
    // ...
}

to

glyph_caches.glyph_atlas.replay_pending_atlas_commands(|recorder| {
    glyph_renderer.reset();
    replay_atlas_commands(&recorder.commands, glyph_renderer);
    // ...
});

The benefit of this approach is that the command list for each AtlasCommandRecorder isn't freed each frame. I think that's beneficial for operations like zoom, where we need to insert into the cache every frame (since resolved font size will presumably change each frame).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Switched to a replay_pending_atlas_commands callback API that keeps recorders in place and only clears their command Vecs after replay, preserving capacity across frames. For the CPU backend, added a replay_pending_atlas_commands_with_pixmaps helper that splits the borrow to give the closure access to both the recorder and the page pixmaps.

/// and then render that into the actual scene, in a similar fashion to
/// bitmap glyphs.
pub struct ColorGlyph<'a> {
pub struct ColrGlyph<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we renamed the other similar structs to use the a "Glyph" prefix, should we do the same here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, of course! Renamed ColrGlyphGlyphColr for consistency with the other structs.

&mut self,
glyph: PreparedGlyph<'_>,
glyph_atlas: &mut C,
image_cache: &mut ImageCache,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to (sometime in the future) move these vello_common dependencies (and the shelf allocator) into a separate crate so that Glifo doesn't depend on vello_common?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the long-term plan is to extract shared types (color, basic geometry) and the allocator into their own crate(s) so glifo doesn't depend on vello_common directly. For now the coupling is pragmatic: those types are stable and we'd just be re-exporting the same things. Happy to track this as a follow-up issue if that would be helpful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to track this as a follow-up issue if that would be helpful?
However you like 🙏 !


/// Configuration for glyph cache eviction behavior.
#[derive(Clone, Debug)]
pub struct GlyphCacheConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to configure minimum font size (after which we use renderer sampling to interpolate smaller sizes) and a maximum size (after which we always directly render)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added max_cached_font_size to GlyphCacheConfig (default 128 ppem) — glyphs above this threshold now bypass the atlas and render directly each frame.

For min_cached_font_size: this would involve caching at the clamped size and applying a downscale transform at composite time, plus adjustments to hinting, metrics, and subpixel offsets across all three glyph branches. I think it's a good idea but a bit too involved for this PR. I'll track it as a follow-up. Happy to hear what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea but a bit too involved for this PR. I'll track it as a follow-up. Happy to hear what you think!

SGTM!

image_cache: &mut ImageCache,
key: GlyphCacheKey,
raster_metrics: RasterMetrics,
) -> Option<(u16, u16, AtlasSlot, &mut AtlasCommandRecorder)>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that this form of API necessitates a redundant hash map lookup on cache miss. I.e., on cache miss, we perform another hash map operation to insert (being unable to re-use the RawEntry or similar).

We could precompute the hash once and use something like HashMap::raw_entry_mut().from_key_hashed_nocheck(...) but let's maybe leave that as a follow up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the double lookup on miss is a known trade-off of the current get/insert split. I'll keep raw_entry_mut with precomputed hashes in mind as a follow-up optimisation. Thanks for flagging it.

Comment on lines +213 to +228
let cache_key = prepared_glyph.cache_key;
let transform = prepared_glyph.transform;
let tint_color = renderer.get_context_color();

if let Some(ref key) = cache_key {
if let CacheResult::CachedAndRendered = render_outline_via_cache::<B>(
renderer,
&glyph.path,
transform,
key,
glyph_atlas,
image_cache,
tint_color,
) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same code as for fill_glyphs? Where is the stroking happening here? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! This is actually already like that on main — I just didn't notice. Split render_outline_directly into fill_outline_directly and stroke_outline_directly, with the stroke variant now correctly calling stroke_path.

// Sub-pixel x offset is quantized into the cache key so that glyphs
// at different fractional positions get distinct cached bitmaps.
let cache_key = self.atlas_cache_enabled.then(|| {
let fractional_x = transform.translation().x.fract() as f32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we encode the fractional translation in the subpixel offset, do we not need to update the transform to discard the fraction from there? Also, why are y fractions not handled here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fractional removal is handled in render_outline_glyph_from_atlas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! The fractional removal at render time is handled by render_outline_glyph_from_atlas which uses tx.floor()/ty.floor() — so the full transform is passed through but only the integer part is used for placement.

For y fractions: when hinting is enabled, the y translation is already rounded in calculate_outline_transform. When hinting is off, if I'm not missing something, vertical sub-pixel shifts are less perceptually impactful for horizontal text, and adding a y dimension would multiply cache entries. Added a comment to clarify this.

Comment on lines +419 to +420
// shifted to the glyph's slot within the page. Contrast with the hybrid
// backend, which gets per-allocation origins from the image cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this, I thought vello_cpu and vello_hybrid work the same, where we have one (or multiple) huge atlas pages and glyphs are drawn into that at specific offsets. Why is there different behavior here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both backends do use atlas pages with glyphs at specific offsets. The difference is in how the paint transform is set up when reading back from the atlas:

  • vello_cpu: The ImageSource references the entire page pixmap, so we need to shift the paint origin to the glyph's (x, y) position within the page.

  • vello_hybrid: Vello's image cache resolves each allocation to its own origin, so we only need to compensate for the GLYPH_PADDING inset.

The underlying atlas structure is the same — it's just the coordinate system for the paint source that differs between the two rendering paths.

vello_hybrid = { git = "https://github.com/linebender/vello.git", branch = "gemberg/glyph-cache-rect-allocator" }

[profile.profiling]
inherits = "release"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strip = "none" might also be worth adding!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added strip = "none" to make sure symbols are preserved for profiling.

Comment on lines +279 to +280
let padding = GLYPH_PADDING as f64;
Affine::translate((-padding, -padding))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need glyph padding for vello_cpu?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vello_cpu doesn't need an explicit GLYPH_PADDING offset in paint_transform because atlas_slot.x / atlas_slot.y are already inset by GLYPH_PADDING from the allocation origin (see GlyphAtlas::insert_entry). So the translate to (-slot.x, -slot.y) already moves past the padding. The hybrid backend's image cache resolves to the allocation origin (before padding), so it needs the explicit (-padding, -padding) shift.

Comment on lines +77 to +80
/// A COLR glyph cached in the atlas.
/// The `Rect` parameter contains the fractional area dimensions
/// to preserve sub-pixel accuracy during rendering.
Colr(Rect),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do Colr and Outline have different behavior here? Because we don't use subpixel offsets here, since they are not as critical for emojis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Colr glyphs use fractional scaled_bbox dimensions for their area (computed from the glyph's actual bounding box), while outline glyphs use integer atlas_slot.width/height since they snap to pixel boundaries. The Rect carries these fractional bounds so the rendered quad matches the original glyph dimensions rather than the padded integer atlas allocation, which avoids subtle scaling artifacts.

/// Outline and COLR glyph commands awaiting replay, indexed by atlas page.
/// Uses `SmallVec` with inline capacity of 1 because most applications use
/// a single atlas page; the common case avoids heap allocation entirely.
pending_atlas_commands: SmallVec<[Option<AtlasCommandRecorder>; 1]>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to configure a maximum value for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the underlying AtlasConfig (from vello_common) already has a max_atlases field that caps the number of atlas pages. When the limit is reached, allocate() fails and the glyph is drawn directly without caching. This is configured on the ImageCache side rather than GlyphCacheConfig.

Copy link
Contributor

@taj-p taj-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks really good! Some comments

hinted: false,
subpixel_x: 0,
context_color: BLACK,
var_coords: SmallVec::new(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on var_coords for the run, this could allocate every time. I think we either should consider:

  1. removing var_coords from the key and rely on the 2 level map structure (which encodes the var_coords anyway); or
  2. we should use the Equivalent trait to create borrowed keys (shown below) and convert it to an owned key on cache misses only - see parley/src/lru_cache.rs for an example implementation
#[derive(Clone, Copy)]
pub struct GlyphCacheKeyRef<'a> {
    pub font_id: u64,
    pub font_index: u32,
    pub glyph_id: u32,
    pub size_bits: u32,
    pub hinted: bool,
    pub subpixel_x: u8,
    pub context_color_packed: u32,
    pub var_coords: &'a [NormalizedCoord],
}

@grebmeg grebmeg force-pushed the gemberg/glyph-cache branch from 96e3aff to d560c82 Compare March 2, 2026 10:05
@grebmeg grebmeg requested review from LaurenzV and taj-p March 2, 2026 10:31
Copy link
Contributor

@LaurenzV LaurenzV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only took a very coarse second look because it would take ages to review again carefully, but I don't think I would have much to add in addition to my first review. So LGTM if LGTM to @taj-p 😄 Awesome work!

Copy link
Contributor

@taj-p taj-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉 ! Let's goooo!!! I didn't re-read all the code line-by-line again, but skimmed through the comments and their resolutions. All seems absolutely BRILLIANT!

@grebmeg grebmeg force-pushed the gemberg/glyph-cache branch from 6c5bc19 to 12cc565 Compare March 3, 2026 05:10
@grebmeg grebmeg added this pull request to the merge queue Mar 3, 2026
Merged via the queue into main with commit b384946 Mar 3, 2026
24 checks passed
@grebmeg grebmeg deleted the gemberg/glyph-cache branch March 3, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants