Justin's profileVarious Technical TopicsPhotosBlog Tools Help

Blog


    February 05

    128bit Encoding Explained

    First, here's the promised VB version of the code. It's almost identical to the Java version, except that...
    1. Unsigned integers remove the need for >>>= (The java unsigned shift right operator)
    2. Pass by reference eliminates the need for the Int128 wrapper class, and simplifies the logic in the strToInt function slightly.
    3. Lack of decrement operator means two statements are required to access the character buffer and decrement the index.
    4. Support for Modules makes it clear that I'm providing utility functions. (As compared to using static methods in a final class.)
    5. Type inferencing eliminates the need to explicitly declare the type for local variables. (Not sure that I like this feature yet, because now it's sometimes hard to tell the type at a glance.
    6. The Select Case statement used in CharToMod64() is much easier to read than the corresponding Java if statements.
    Module Utils
    
        Const bUnder As Byte = 95
        Const bDollar As Byte = 36
        Const bQuest As Byte = 63
        Const bFirstUAlpha As Byte = 65
        Const bFirstLAlpha As Byte = 97
        Const bFirstNum As Byte = 48
        Const mask6Bits As UInt64 = 63
        Const mask4bits As UInt64 = 15
        Const mask2Bits As UInt64 = 3
        Const bitsPerChar As Integer = 6
    
        Dim CharMap() As Char = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$".ToCharArray
    
        Function UInt128ToStr(ByVal msb As UInt64, ByVal lsb As UInt64) As String
            Const MAX_BYTE As Integer = 31
            Dim buf(0 To MAX_BYTE) As Char
            Dim i = MAX_BYTE
    
            ' 64 bit number has ten 6bit encoded values plus four bits left over
            For n = 1 To 10
                If lsb = 0 AndAlso msb = 0 Then
                    Exit For ' Eliminate leading zeros
                End If
                Dim b = CByte(lsb And mask6Bits)
                buf(i) = CharMap(b)
                i -= 1
                lsb >>= bitsPerChar
            Next
    
            ' 4 bits from the lsb, and 2 from the msb
            If lsb > 0 OrElse msb > 0 OrElse i = MAX_BYTE Then
                Dim leftOver = lsb And mask4bits
                Dim firstTwo = msb And mask2Bits
                Dim b = CByte((firstTwo << 4) Or leftOver)
                buf(i) = CharMap(b)
                i -= 1
                msb >>= 2
            End If
    
            Do While msb <> 0
                Dim b = CByte(msb And mask6Bits)
                buf(i) = CharMap(b)
                i -= 1
                msb >>= bitsPerChar
            Loop
    
            ' It's easiest to simply prefix an underscore to avoid
            ' illegal identifiers and clashes with keywords and literals.
            buf(i) = "_"c
            i -= 1
    
            Return New String(buf, i + 1, MAX_BYTE - i)
        End Function
    
        Public Sub StrToUInt128(ByVal s As String, ByRef msb As UInt64, ByRef lsb As UInt64)
            Dim buf() = ASCIIEncoding.ASCII.GetBytes(s)
            Dim maxByte = buf.Length - 1
            Dim i = maxByte
            Dim minByte = 0
            If buf(0) = bUnder Then
                minByte = 1
            End If
            msb = 0
            lsb = 0
    
            For n = 0 To 9
                If i < minByte Then
                    Return
                End If
                Dim b = CharToMod64(buf(i))
                If b <> 0 Then
                    If n <> 0 Then
                        b <<= (n * bitsPerChar)
                    End If
                    lsb = lsb Or b
                End If
                i -= 1
            Next
    
            If i >= minByte Then
                Dim b = CharToMod64(buf(i))
                If b <> 0 Then
                    Dim leftOver = b And mask4bits
                    msb = (b >> 4) And mask2Bits
                    leftOver <<= 60
                    lsb = lsb Or leftOver
                End If
                i -= 1
            End If
    
            For m = 0 To 10
                If i < minByte Then
                    Return
                End If
                Dim b = CharToMod64(buf(i))
                If b <> 0 Then
                    b <<= (m * bitsPerChar + 2)
                    msb = msb Or b
                End If
                i -= 1
            Next
        End Sub
    
        Private Function CharToMod64(ByVal c As Byte) As Byte
            Const b10 As Byte = 10
            Select Case c
                Case bFirstNum To bFirstNum + 10
                    Return c - bFirstNum
                Case bFirstUAlpha To bFirstUAlpha + 26
                    Return c - bFirstUAlpha + b10
                Case bFirstLAlpha To bFirstLAlpha + 26
                    Return c - bFirstLAlpha + bDollar
                Case bUnder
                    Return 62
                Case bDollar
                    Return 63
                Case Else
                    Debug.Assert(False, "Unexpected Character " & c)
                    Return 0
            End Select
        End Function
    
    End Module

    Turning Numbers Into Strings

    Because neither language has native support for 128bit integers, we have to resort to using two 64 bit integers. The algorithm I devised simply encodes every 6 bits of the input integers as a single character. I thought this worked out perfectly, because my first reading of Section 3.8 of the Java Language Spec lead me to believe that Java supports A-Z, a-z, 0-9, $, and _ as the only legal characters in identifiers. I learned later that it actually supports many more, so this algorithm could probably be extended to use N bits instead of only 6, which would make the resulting strings even smaller. In the end, I decided against this, because the current choice is less likely to have problems with character sets and encoding/decoding.

    With 6 bits per character and 128 bits total, you can see that (128/6= 21.3~) at most 22 characters are required to hold any number. To ensure legal Java identifiers we also prefix an underscore to each generated string.

    The first step in the algorithm is to allocate a temporary buffer to hold the decoded characters.  I used 32, because I'm a power-of-two kind of guy.

    Next we process the least significant 64 bit number, repeatedly masking out the lowest 6 bits, converting them to a Character using a lookup table, and then shifting right 6 bits to get the next piece. In Java, we must use the unsigned shift operator for this, because a signed shift will pull in 1's from the left, destroying the number. Notice that we only break out of the loop when we've either taken the first 60 bits of the 64 bit number, or both msb and lsb are zero. If only lsb is zero, then we just repeatedly divide 0/6 until 60 bits have been processed. This ensures that we get any trailing zeros. (e.g. Because 100 <> 100000.)

    The middle section of code is to handle the remaining 4 bits from the lsb, and the first 2 bits of the msb. This is accomplished using masks and shifts appropriately. The i = MAX_BYTE condition is there to ensure that the number zero is written out as "_0" instead of "_".

    Finally we process the remaining 62 bits from the msb. As soon as msb = 0 we're free to break out of the loop, because we don't want any leading zeros to pad the string.

    One nice property of this algorithm is that it encodes sequential numbers in an efficient and almost human readable format. The first 10 numbers are just encoded as "_0" through "_9", followed by "_A" through "_Z", "_a" through "_z", then "_" and "$". It then rolls over to the next digit with the next 64 encoded as "_10" through "_1$". Anyone used to hexidecimal to string encoding should find this intuitive.  Furthermore it should be obvious that for random 128bit numbers this algorithm gives an optimal encoding to 64 characters.

    Turning Strings Into Numbers

    Reversing the process is only slightly trickier than the initial encoding. First we convert the input string into an array of Bytes. One complication is that I didn't want to assume in the decoder that every input string would start with an underscore, so we check for this at the start and set minByte to either 1 or 0. This was originally there, because in the original Int128ToString() function I only prepended the underscore when necessary to make a legal Java identifier. However, I removed that code due to all the complications in checking for keyword and literal conflicts, and the flexibility in the decoding seemed nice so I left it.

    Just as with encoding, the decoder steps through the first 60 bits worth of characters. This time, we use a function to convert the input characters back to base64 numbers using CharToMod64. Each pass through the loop using a bitwise OR operation to combine the returned base64 number with the current value of the lsb.  The trick here is that I found it most straightforward to shift the decoded base64 number to the correct position (b <<= n * 6) and then combine with the lsb using a bitwise OR.

    The middle section once again converts the 6 bits from the decoded number into the remaining 4 bits for the lsb and the first 2 bits of the msb.

    The final section processes the remaining characters into the final 62 bits of the msb.

    Performance

    I found a few interesting things with the performance of this algorithm.

    First, I wrote a simple program to time how long it takes to encode the first million numbers, encode the last million,  encode and decode the first million, and finally encode and decode the last million. Here's the code in VB.

    Public Sub Main()
        Dim startTime = DateTime.Now
        For i As UInt64 = 0 To 999999
            Dim enc = UInt128ToStr(i, i)
        Next
        Dim stopTime = DateTime.Now
        Console.WriteLine("Encode took " & (stopTime - startTime).TotalMilliseconds)
    
        startTime = DateTime.Now
        For i As Int64 = -1L To -1000000 Step -1
            Dim enc = UInt128ToStr(CULng(i), CULng(i))
            Dim msb, lsb As UInt64
            StrToUInt128(enc, msb, lsb)
        Next
        stopTime = DateTime.Now
        Console.WriteLine("Encode big took " & (stopTime - startTime).TotalMilliseconds)
    
        startTime = DateTime.Now
        For i As UInt64 = 0 To 999999
            Dim enc = UInt128ToStr(i, i)
            Dim msb, lsb As UInt64
            StrToUInt128(enc, msb, lsb)
        Next
        stopTime = DateTime.Now
        Console.WriteLine("Roundtrip took " & (stopTime - startTime).TotalMilliseconds)
    
        startTime = DateTime.Now
        For i As Int64 = -1L To -1000000 Step -1
            Dim enc = UInt128ToStr(CULng(i), CULng(i))
            Dim msb, lsb As UInt64
            StrToUInt128(enc, msb, lsb)
        Next
        stopTime = DateTime.Now
        Console.WriteLine("Roundtrip big took " & (stopTime - startTime).TotalMilliseconds)
    End Sub

    Although both Java and .NET are probably more than fast enough, Java was 2-3 times faster at encoding, but a round trip encode/decode took about the same time for both. (Further testing of decode seemed to show that the .NET version really is faster at decoding.) It's not worth it to me at this point to figure out why, but if someone is interested I would recommend using ILDASM to inspect the generated IL assembly code for the .NET version. I don't think this is the kind of thing that's going to be helped by source analysis or profiling.

    Java

    Encode took 188
    Encode big took 249
    Roundtrip took 763
    Roundtrip big took 985

    VB

    Encode took 555.6285
    Encode big took 565.3935
    Roundtrip took 778.2705
    Roundtrip big took 917.91

    For comparison I also found source code online for a fast Base64 encoder (http://migbase64.sourceforge.net/).

    Base64

    Encode took 375
    Encode big took 374
    Roundtrip took 1352
    Roundtrip big took 1332

    This is not a slight on Mikael Grev's algorithm at all, because mine doesn't even pretend to generate compliant RFC2054 Base64 encoded strings.

    In the end, I think I came up with a pretty tight little algorithm to solve a particular problem. I hope someone finds it useful.