Skip to content

GB18030-2005, Unicode 4.1, and "web variant decoder" #16

@Artoria2e5

Description

@Artoria2e5

Current GB18030's DBCS subset (aka "GBK") data:

  • should be updated with a U+E7C7/U+1E3F swap in GB18030-2005,
  • may use 24 Unicode 4.1 mappings instead of the PUA code points for decoder output. This can be implemented as an extra convertor filter.

The offset table should be updated with one conformant to GB18030-2005 like whatwg's. The current offset table omits a lot of tiny gaps like U+00A5, so you get off-by-a-few errors. (Or you can "do it the glibc way" for speed -- only consider the bigger continuous blocks linear, and hardcode the smaller fragments. These holes are annoying indeed.)

A "web gb18030" decoder implemented as GB18030-WEB that tolerates GBK's 0x80 → U+20AC will be helpful too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions