Skip to content

Share more data between HarfRust and Parley #561

@conor-93

Description

@conor-93

Currently, Parley uses ICU for Unicode character information and codepoint normalization, to drive general text analysis and font selection respectively. We use HarfRust for text shaping, which also includes this character and normalization data. This means the data is included twice, so there is room for improvement. It looks like we can reduce binary size and memory usage of Parley's core functionality by ~75kB in doing this (~40kB from normalization data, ~35kB from Unicode codepoint data).

Normalization data
For normalization, this would require asking if we can expose related symbols in HarfRust i.e. in lib.rs:

pub use hb::unicode::{compose, decompose, Codepoint};

ICU normalization data costs ~40kB in compile size currently:

  • normalizer_nfd_data_v1.rs.data: 28,208 B
  • normalizer_nfd_supplement_v1.rs.data: 4,486 B
  • normalizer_nfd_tables_v1.rs.data: 2,177 B
  • normalizer_nfc_v1.rs.data: 5,332 B

Unicode codepoint data
Parley's ICU-derived Unicode codepoint information costs ~70kB in compile size.

It's worth noting that not all information we use when composing our trie table exists in HarfRust, so we will need to maintain this table, but reduce the size of the backing data from u32 to u16 (saving ~35kB compiled size)

Script, GeneralCategory and is_variation_selector are clearly supported. is_regional_indicator is not clearly marked in HarfRust, but is supported; it's solved with a simple range check (see also: original HarfBuzz). We could just copy the range check, though it'd be nicer to defer to HarfRust for this.

12 bits (so, a u16) is still required per cell if we use HarfRust for the rest:

    //const SCRIPT_BITS: u32 = 8; // In HarfRust
    //const GC_BITS: u32 = 5; // In HarfRust
    const GCB_BITS: u32 = 5;
    const BIDI_BITS: u32 = 5;
    const IS_EMOJI_OR_PICTOGRAPH_BITS: u32 = 1;
    //const IS_VARIATION_SELECTOR_BITS: u32 = 1; // In HarfRust
    //const IS_REGION_INDICATOR_BITS: u32 = 1; // In HarfRust
    const IS_MANDATORY_LINE_BREAK_BITS: u32 = 1;

If we could retrieve just one of GraphemeClusterBreak or BidiClass from HarfRust, we could further reduce the cell size to u8, but I don't see these available there, so this will probably never be possible.

Addendum: Beyond this, we could eventually use packTab for the remaining 12 bits, to see if it condenses the remaining 35kB further and/or improves access latency. This is lower priority and should probably be separate work from the reuse work proposed here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions