• Latest Articles
  • Atom Feed
  • About
  • A Small Variable-Length Bit/Byte Encoding Tommie's blog

    This is an encoding that unifies the treatment of booleans, bit sets, integers and arbitrary byte data. For a while, I’ve been toying with the idea of unifying variable-length integer encoding with encoding the length of byte data. It’s dogmatically annoying that Protocol Buffers doesn’t support uint128 or bigger, and I have to resort to bytes. I want to express an RSA modulus as a big integer, not as a byte sequence! Or conversely, I want to treat byte sequences and integers the same at the encoding level. This grew out of that mental exercise.

    The goals are to:

    Background

    There are lots of variable-length integer encodings out there, but they mostly fall into three categories.

    1. Protocol buffers uses 7-bit octets with the upper bit indicating if another octet follows. WebAssembly (LEB-128) is basically the same.
    2. ASN.1 BER/CER/DER lengths and integers use length prefixes.
    3. ASN.1 BER/CER/DER tags use the top bit to encode continuation bytes/octets.
    4. UTF-8 uses a variable number of top bits in the first byte to say how many bytes follow. A single test can be used to discover the full size, which is nice. It still uses a bit in each following byte to allow re-synchronizing the code point sequence, so is a bit of a hybrid.

    The ones using continuation bits always require lots of bit fiddling, even when the values are so large that optimizing is just wasteful. The ones using length prefixes don’t optimize for small values.

    The Encoding Format

    There are three forms: we’ll call them “short”, “medium” and “long”. The most significant bits of the first byte determine the form:

    Short:  0vvv vvvv
            11vv vvvv
    Medium: 100n nsvv vvvv...
    Long:   101N Nlll llll... svvv...
    

    where v is a value bit, n and l are value length bits, and N is a length-of-length bit. s is a sign bit, which is only present for integer types.

    We use big-endian to preserve the natural order of things. Little-endian is great for casting integer sizes during computations, but for communication, the most significant value is normally written first. (Well, except in DNS domain names, of course.)

    The value bits are mostly straight forward. In the medium form, we have room for the top two value bits in the first byte, and the sign bit. We use two’s complement signed values for integers, allowing removing leading zeros and ones as appropriate.

    In the medium form, the n is the log2 of the number of value bytes that follow:

    This is only useful for bit-aligned values like integers and bit sets. For byte-aligned values, being limited to small power-of-two sizes is a bit awkward, and we don’t generally use the medium form for byte sequences. (If we do, the initial three svv bits are zero, and reserved for future use.)

    In the long form, the N is a modified log2 of the number of length bytes that follow:

    This modification allows us to encode short byte sequences (up to 8 bytes) with a single-byte overhead. The somewhat awkward position of the zero makes it easy compute the length from N using a bit shift and bitwise-AND. If the length of the value requires 4 bytes to describe (i.e. the value is larger than 23+2*8 bytes long), the overhead of the leading bytes is so small anyway, that we might just skip 4. But at the smaller end, allowing zero length for short strings is useful.

    Examples of the three forms (requires JavaScript):

    
    

    Live Example

    
    
      
    
    

    Properties

    Floating Point Numbers

    The approach commonly taken today is to use IEEE 754 to copy values directly from memory onto the wire. E.g. Protocol Buffers and WebAssembly do that, while using variable-length encoding of integers. Ideally, an encoding format would be able to express round values in shorter form, just like we do with small integers.

    The binary32 encoding stores floating points as

    The biggest differences to integers are that floating point numbers consist of three parts and significand values have left-aligned bits, while integers are right-aligned. This is because the significand is always normalized to start with an implicit 1, and the other bits are fractional. Arguably, the sign bit belongs to the significand, so perhaps we can see this as two parts.

    Regardless of the parts, we can simply reinterpret the float32 (in this case) as a uint32, and attempt to encode that. As our integer representation assumes zeros on the left can be removed, we have to do something to be able to cut off zeros on the right instead. We have three choices: cut off trailing zeros, byte-swap so the trailing zeros become leading zeros, or reverse all bits.

    One argument in favor of keeping the exponent to the left (i.e. not byte-swapping) is that the exponent’s value is well-known for common values. The exponent in binary32 is encoded as excess-127, which means it is normally hovering around 127, translating to zero. This works surprisingly well (in terms of coding size) with our encoding due to two things:

    1. the right-shift due to the sign bit means the exponent range in the first byte is really [0, 63], and
    2. even if the sign bit is set, as long as the second bit is set (exponent is >=0), it triggers the 112 case, which is efficient.

    Byte-swapping also doesn’t reverse bits, so we’d likely have extra zero (upper) bits that can’t be compressed away. Reversing at the bit level would help, but is an uncommon operation, and is sure to mess with the exponent’s nice property mentioned above. So we rule out byte-swapping and bit reversing.

    Let’s look closer at cutting off trailing zeros. If we simply count trailing zeros, and shift them away, we need to carry how many zeros were removed somehow. Otherwise, we can’t decode the value unambiguously. The most significant bit can be either zero or one, so it doesn’t help. We could add a “start” bit and decoding could shift left until that first bit goes away. Or we could add a bit counter. (If our encoded n and l counted bits instead of bytes, that would solve it, but since n is in log2 to be compact, the granularity wouldn’t be enough.)

    As an aside, JavaScript is gifted with a Count Leading Zeros, but no Count Trailing Zeros function! C++ has countr_zero and GCC has __builtin_ctz. Even WebAssembly supports both. Modern CPUs can do both efficiently.

    A more enticing option is to somehow infer the size from the encoded value. We can accomplish this by the integer decoder telling the floating point number decoder how many value bits it saw. As long as the trailing bit removal is aligned to bytes, while taking the three bits in the initial byte into account for the medium form, that is enough to be able to shift left. One way is by having an option in the integer decoder to left-align the output to the nearest byte boundary. That would make our encoded leading zeros into trailing zeros again, by the right amount, without having to pass meta information about the input encoding.

    We can over-engineer this encoding by splitting the exponent and significand. As the appendix shows, it makes things worse, as we now have two integer overheads.

    Should You Use This Encoding?

    While at Google, I was told the servers spend a significant portion of processing power parsing Protocol Buffer messages. Much of that time is probably due to the varint-encoding, and the byte counting. Would it have been better to use simple XDR with generic compression? It would have allowed for relatively simple hardware offloading of compression. And it’s certainly where we are moving to today. Google has created FlatBuffers in this vein, though it had by no means replaced Protocol Buffers internally in 2021.

    Most protocols defined today do what low-level protocols like IP and USB have always been doing: use a schema with fixed-length integers, like XDR. E.g. MessagePack, Cap’n Proto, TLS.

    The TLS 1.3 presentation language is actually very nice, and you should probably be using that for describing data structures. It’s expressive, but not as cumbersome as ASN.1. The encoding follows clearly from the presentation language. There’s even eTPL: An Enhanced Version of the TLS Presentation Language Suitable for Automated Parser Generation which discusses a few shortcomings, and includes a prototype code generator for C++.

    Adding generic compression on top of that makes sense. Cap’n Proto suggests using LZ4 (for speed), or zlib (for size).

    Bottom Line

    If you can’t use fixed-length integers and a generic compression algorithm, but still want small wire transfers, and a generic parser, perhaps the presented encoding is useful. You can use the same length-value encoding for data type tags, integers, big integers, octet sequences, floating point numbers, bit sets, tuples and more. If the value is small, the length can be compressed or even omitted.

    Appendix: Splitting Floating Point Numbers

    The sign really belongs with the significand, but exponent and significand are just two signed integers of 8 bits and 24 bits. So we can encode them as such (keeping the exponent first for good measure.)

    For the exponent, if it is within [-64 and +127], it can be encoded as a single byte. The special values 0 and 0xFF, which are used for values close to zero, infinity and NaN, fall into this category.

    The significand (with the moved sign bit at the top) is just another integer. If trailing zeros can be removed, they are not part of the encoding. Note that binary32 doesn’t use two’s complement for the significand: there’s simply a sign bit and the absolute fraction.

    This means that

    This assumes that 012 is our special prefix instead of 102. -infinity would otherwise take three bytes. The drawback of this modification is that not all ASCII characters fit in [0, 63]. I.e. the string “0” (ASCII 48) would occupy one byte, but “A” (ASCII 65) would require two.

    The NaN cases are special. Since only the first and last bits are used, we’ve compressed them into a single significand byte.

    Encodings of other values depend on how many leading zeros can be removed from the exponent, and trailing zeros from the significand. In general, since decimal and binary fractions don’t line up, you will incur a two byte overhead on top of the four or eight bytes normally used. However, if your input is often integers, with perhaps the odd 0.5 at the end, and you want to support double precision, this can save some bytes. E.g. the value 100.5 is encoded as three bytes, regardless of the IEEE 754 representation. Integers up to 1,023 fit in two bytes.

    The complexity this adds, and the overhead of always having to encode two separate integers makes this unappealing.