Conversation
DHowett
left a comment
There was a problem hiding this comment.
How do you feel about non-canonical overlong encodings (which is a problem UTF-8 also suffers from)?
that would be something like encoding 0x01 as 0x17 0x00 0x00 0x00, if I have parsed your description correctly.
|
Yeah, I thought about that. SQLite's varint for instance doesn't support this but has a more efficient encoding. I intentionally decided against that, for one because decoding becomes faster, and also because non-canonical encodings are quite beneficial: When a LSH instruction jumps further down into the instruction stream, the address offset depends on the number of bytes in-between. That number depends on the encoding size of all the varints in-between. And those in turn could be downward jumps which, again, depend on the encoding size of other varints. I'm sure I'll come up with a solution to this recursive problem at some point, if I want to. But I'm fairly certain that having non-canonical encodings will allow for easy "tie breakers" for any such algorithm. |
| // Copyright (c) Microsoft Corporation. | ||
| // Licensed under the MIT License. | ||
|
|
||
| //! Variable-length `u32` encoding and decoding, with efficient storage of `u32::MAX`. |
There was a problem hiding this comment.
I guess I don't understand - it's not a u32 encodig, it's a u28 encoding with a special case for u32::MAX and a pretty significant gap between 268435455 and 4294967295
There was a problem hiding this comment.
Yeah, that's fair. Perhaps I should move this into the lsh project now that I made it a library. 🤔 The reason it's an "u28" is because lsh really doesn't need values >2^28, while an efficient compression for a >2^28 value is still useful (it's used for setting the input offset to max. when matching a .*).
For now, this module has no purpose.
I wrote it as an experiment for encoding VM instructions.