![]() |
|
Spaces home Various Technical TopicsProfileFriendsBlog | ![]() |
|
Various Technical TopicsAugust 02 Quantifying ExperienceIt's hard to find good developers. And it's worse than useless to filter candidates by "experience" as included on a resume. Statements like "10 years Java experience" or "15 years SQL experience" are meaningless. The problem is that you can't sum up real experience in a single concise number, and if you could it wouldn't be measured in years. We could try to come up with a formula, but any attempt to do so makes it clear that "Software Developer" encompasses a very wide variety of actual skills. What I'd really like to see on a resume, or in addition to a resume, is a real list summarizing all the things you've actually done, and the lessons you learned while doing them. This includes all of the following…
The above list probably does give us a fairly good measure of experience. If you can work through each of the above items completely in a fairly terse writing style and finish in less than a month, then you're probably not very experienced. Looking at this list, I can't even estimate how long it would take me to thoroughly explore the 5 items. I'm not sure, but it might be worth working through the exercise for yourself. Just don't expect anyone else to read it. My next post, … Experience Isn't Everything May 26 Comparing Strings Using Natural OrderingThis topic has been brought up numerous times before, but I wanted to take a shot at another solution, because I think there is still room for improvement, and I wanted a simple example to experiment with some new VB9 functionality. Here are a bunch of links to the most recent information on the problem and a bunch of potential solutions. By spelunking through these, you ought to be able to find a solution in any language you like, and every possible combination of algorithms. http://www.codinghorror.com/blog/archives/001018.html http://www.davekoelle.com/alphanum.html http://nedbatchelder.com/blog/200712/human_sorting.html http://www.interact-sw.co.uk/iangblog/2007/12/13/natural-sorting http://www.codeproject.com/KB/string/NaturalComparer.aspx Probably the simplest example of sorting above is the following Python program which originally came from a comment posted to Ned Batchelder's blog above… 1: def natural_sort(lst): 2: to_int = lambda text: int(text) if text.isdigit() else text 3: alphanum_key = lambda key: [ to_int(c) for c in re.split('([0-9]+)', key) ] 4: lst.sort( key=alphanum_key )Ian Griffiths was even able to more or less duplicate this solution in C# 3 by writing a couple of general purpose utility functions. After a little cleanup, it looks like this… 1: static IOrderedEnumerable<string> NaturalSort(List<string> lst) { 2: Func<string, object> ToInt = s => { 3: try { 4: return int.Parse(s); 5: } catch { 6: } 7: return s; 8: }; 9: return lst.OrderBy(s => Regex.Split(s.Replace(" ", ""), "([0-9]+)") 10: .Select(ToInt), new EnumerableComparer<object>()); 11: } One problem with Ian's solution is that it doesn't do an in-place sort, which makes it not quite a fair comparison with the original. Also, OrderBy uses deferred execution, so you must be careful when comparing the performance that you force the sort to happen (e.g. Call List.GetEnumerator().GetNext()) Of course, if it's fair to write a missing EnumerableComparer class, then why not just implement a NaturalComparer, which would make the resulting code even simpler… lst.Sort(NaturalComparer.CurrentCultureIgnoreCase); or even… lst.OrderBy(s => s, NaturalComparer.CurrentCultureIgnoreCase); This seems to give you the best solution with the added flexibility of working with OrderBy and Sort. (Interestingly, OrderBy often seems to be slightly faster than Sort, which seems strange considering that Sort doesn't have to allocate an entirely new structure to contain the results.) Before I show you my implementation for NaturalComparer, here are some problems found in most of the previous solutions I've seen.
The following is the Compare function from my solution to the problem. You can download the full source from . Public Function Compare(ByVal x As String, ByVal y As String) _ As Integer Implements IComparer(Of String).Compare If x Is Nothing AndAlso y Is Nothing Then Return 0 End If If x Is Nothing Then Return -1 End If If y Is Nothing Then Return 1 End If Dim xpos, ypos As Integer Do While xpos < x.Length AndAlso ypos < y.Length xpos = FindFirstIndexOf(x, xpos, myIsLetOrDig) ypos = FindFirstIndexOf(y, ypos, myIsLetOrDig) If xpos = -1 AndAlso ypos = -1 Then Return 0 ElseIf xpos = -1 Then Return -1 ElseIf ypos = -1 Then Return 1 ElseIf Char.IsNumber(x(xpos)) AndAlso Char.IsNumber(y(ypos)) Then Dim xtmp = FindNextIndexOf(x, xpos, myIsNonZero) Dim ytmp = FindNextIndexOf(y, ypos, myIsNonZero) Dim xend = FindNextIndexOf(x, xpos, myIsNotNum) Dim yend = FindNextIndexOf(y, ypos, myIsNotNum) xpos = If(xtmp = xend, xtmp - 1, xtmp) ypos = If(ytmp = yend, ytmp - 1, ytmp) If xend - xpos < yend - ypos Then Return -1 ElseIf xend - xpos > yend - ypos Then Return 1 Else Dim iy = ypos For ix = xpos To xend - 1 If x(ix) < y(iy) Then Return -1 ElseIf x(ix) > y(iy) Then Return 1 End If iy += 1 Next End If ElseIf Char.IsNumber(x(xpos)) Then Return -1 ElseIf Char.IsNumber(y(ypos)) Then Return 1 Else Dim xend = FindNextIndexOf(x, xpos, myIsNotLet) Dim yend = FindNextIndexOf(y, ypos, myIsNotLet) Dim l = xend - xpos Dim r = myCmpInfo.Compare(x, xpos, l, y, ypos, l, myCmpOpt) If r <> 0 Then Return r End If xpos = xend - 1 ' -1, because we're about to +1 ypos = yend - 1 End If xpos += 1 ypos += 1 Loop If xpos >= x.Length AndAlso ypos >= y.Length Then Return 0 ElseIf xpos >= x.Length Then Return -1 ElseIf ypos >= y.Length Then Return 1 End If Return 0 End Function The basic idea is to iterate forward through the two input strings, using the FindXXX functions to find the separators between numbers, letters, and ignored characters. We return from Compare as soon as possible without having to look at any characters following the first difference. No extra allocations are performed. For example, when comparing strings I used the form of CompareInfo.Compare that takes offsets and lengths for the two strings to avoid having to allocate substrings. Originally the code used inline lambda expressions for the arguments to FindXXX, however testing showed a significant performance increase from predefining those functions outside any individual Compare, as they are always the same. For example, here's the definition for myIsNonZero, and the others are much the same. Private Shared ReadOnly myIsNonZero As CharPred = Function(c) c <> "0"c Finally, here's what FindNextIndexOf looks like… 1: Public Delegate Function CharPred(ByVal c As Char) As Boolean 2: 3: Public Function FindNextIndexOf(ByVal s As String, _ 4: ByVal start As Integer, ByVal p As CharPred) As Integer 5: 6: If s Is Nothing OrElse s.Length = 0 Then 7: Return s.Length 8: End If 9: For i = start To s.Length - 1 10: Dim c = s(i) 11: If p(c) Then 12: Return i 13: End If 14: Next 15: Return s.Length 16: End Function Some NumbersI manually did a few test comparisons between my solution and an optimized form of Ian's (compiled regex and int.TryParse) I compared using two lists of strings shuffled randomly. The first list contained the numbers 1-5000, and the second contained alternating strings, numbers, and special characters of the form "string n string n string n" where n was 1-5000. Optimized IanG, string type one (108-115ms) Optimized IanG, string type two (291-316ms) Mine, string type one (16-20ms) Mine, string type two (158-171ms) Once again, these numbers aren't really fair to compare, because my solution handles things like case insensitivity, ignoring non alphanums, culture idiosyncracies, etc. But it's nice to know that the extra features are more than paid for with the more efficient algorithm. There's still room for improvement, so I may make updates from time to time if I need to use this code on a real project.
Feel free to use the code any way you see fit, and please let me know if you find any problems or have any other suggestions. February 05 128bit Encoding ExplainedFirst, here's the promised VB version of the code. It's almost identical to the Java version, except that...
Module Utils Const bUnder As Byte = 95 Const bDollar As Byte = 36 Const bQuest As Byte = 63 Const bFirstUAlpha As Byte = 65 Const bFirstLAlpha As Byte = 97 Const bFirstNum As Byte = 48 Const mask6Bits As UInt64 = 63 Const mask4bits As UInt64 = 15 Const mask2Bits As UInt64 = 3 Const bitsPerChar As Integer = 6 Dim CharMap() As Char = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$".ToCharArray Function UInt128ToStr(ByVal msb As UInt64, ByVal lsb As UInt64) As String Const MAX_BYTE As Integer = 31 Dim buf(0 To MAX_BYTE) As Char Dim i = MAX_BYTE ' 64 bit number has ten 6bit encoded values plus four bits left over For n = 1 To 10 If lsb = 0 AndAlso msb = 0 Then Exit For ' Eliminate leading zeros End If Dim b = CByte(lsb And mask6Bits) buf(i) = CharMap(b) i -= 1 lsb >>= bitsPerChar Next ' 4 bits from the lsb, and 2 from the msb If lsb > 0 OrElse msb > 0 OrElse i = MAX_BYTE Then Dim leftOver = lsb And mask4bits Dim firstTwo = msb And mask2Bits Dim b = CByte((firstTwo << 4) Or leftOver) buf(i) = CharMap(b) i -= 1 msb >>= 2 End If Do While msb <> 0 Dim b = CByte(msb And mask6Bits) buf(i) = CharMap(b) i -= 1 msb >>= bitsPerChar Loop ' It's easiest to simply prefix an underscore to avoid ' illegal identifiers and clashes with keywords and literals. buf(i) = "_"c i -= 1 Return New String(buf, i + 1, MAX_BYTE - i) End Function Public Sub StrToUInt128(ByVal s As String, ByRef msb As UInt64, ByRef lsb As UInt64) Dim buf() = ASCIIEncoding.ASCII.GetBytes(s) Dim maxByte = buf.Length - 1 Dim i = maxByte Dim minByte = 0 If buf(0) = bUnder Then minByte = 1 End If msb = 0 lsb = 0 For n = 0 To 9 If i < minByte Then Return End If Dim b = CharToMod64(buf(i)) If b <> 0 Then If n <> 0 Then b <<= (n * bitsPerChar) End If lsb = lsb Or b End If i -= 1 Next If i >= minByte Then Dim b = CharToMod64(buf(i)) If b <> 0 Then Dim leftOver = b And mask4bits msb = (b >> 4) And mask2Bits leftOver <<= 60 lsb = lsb Or leftOver End If i -= 1 End If For m = 0 To 10 If i < minByte Then Return End If Dim b = CharToMod64(buf(i)) If b <> 0 Then b <<= (m * bitsPerChar + 2) msb = msb Or b End If i -= 1 Next End Sub Private Function CharToMod64(ByVal c As Byte) As Byte Const b10 As Byte = 10 Select Case c Case bFirstNum To bFirstNum + 10 Return c - bFirstNum Case bFirstUAlpha To bFirstUAlpha + 26 Return c - bFirstUAlpha + b10 Case bFirstLAlpha To bFirstLAlpha + 26 Return c - bFirstLAlpha + bDollar Case bUnder Return 62 Case bDollar Return 63 Case Else Debug.Assert(False, "Unexpected Character " & c) Return 0 End Select End Function End Module Turning Numbers Into StringsBecause neither language has native support for 128bit integers, we have to resort to using two 64 bit integers. The algorithm I devised simply encodes every 6 bits of the input integers as a single character. I thought this worked out perfectly, because my first reading of Section 3.8 of the Java Language Spec lead me to believe that Java supports A-Z, a-z, 0-9, $, and _ as the only legal characters in identifiers. I learned later that it actually supports many more, so this algorithm could probably be extended to use N bits instead of only 6, which would make the resulting strings even smaller. In the end, I decided against this, because the current choice is less likely to have problems with character sets and encoding/decoding. With 6 bits per character and 128 bits total, you can see that (128/6= 21.3~) at most 22 characters are required to hold any number. To ensure legal Java identifiers we also prefix an underscore to each generated string. The first step in the algorithm is to allocate a temporary buffer to hold the decoded characters. I used 32, because I'm a power-of-two kind of guy. Next we process the least significant 64 bit number, repeatedly masking out the lowest 6 bits, converting them to a Character using a lookup table, and then shifting right 6 bits to get the next piece. In Java, we must use the unsigned shift operator for this, because a signed shift will pull in 1's from the left, destroying the number. Notice that we only break out of the loop when we've either taken the first 60 bits of the 64 bit number, or both msb and lsb are zero. If only lsb is zero, then we just repeatedly divide 0/6 until 60 bits have been processed. This ensures that we get any trailing zeros. (e.g. Because 100 <> 100000.) The middle section of code is to handle the remaining 4 bits from the lsb, and the first 2 bits of the msb. This is accomplished using masks and shifts appropriately. The i = MAX_BYTE condition is there to ensure that the number zero is written out as "_0" instead of "_". Finally we process the remaining 62 bits from the msb. As soon as msb = 0 we're free to break out of the loop, because we don't want any leading zeros to pad the string. One nice property of this algorithm is that it encodes sequential numbers in an efficient and almost human readable format. The first 10 numbers are just encoded as "_0" through "_9", followed by "_A" through "_Z", "_a" through "_z", then "_" and "$". It then rolls over to the next digit with the next 64 encoded as "_10" through "_1$". Anyone used to hexidecimal to string encoding should find this intuitive. Furthermore it should be obvious that for random 128bit numbers this algorithm gives an optimal encoding to 64 characters. Turning Strings Into NumbersReversing the process is only slightly trickier than the initial encoding. First we convert the input string into an array of Bytes. One complication is that I didn't want to assume in the decoder that every input string would start with an underscore, so we check for this at the start and set minByte to either 1 or 0. This was originally there, because in the original Int128ToString() function I only prepended the underscore when necessary to make a legal Java identifier. However, I removed that code due to all the complications in checking for keyword and literal conflicts, and the flexibility in the decoding seemed nice so I left it. Just as with encoding, the decoder steps through the first 60 bits worth of characters. This time, we use a function to convert the input characters back to base64 numbers using CharToMod64. Each pass through the loop using a bitwise OR operation to combine the returned base64 number with the current value of the lsb. The trick here is that I found it most straightforward to shift the decoded base64 number to the correct position (b <<= n * 6) and then combine with the lsb using a bitwise OR. The middle section once again converts the 6 bits from the decoded number into the remaining 4 bits for the lsb and the first 2 bits of the msb. The final section processes the remaining characters into the final 62 bits of the msb. PerformanceI found a few interesting things with the performance of this algorithm. First, I wrote a simple program to time how long it takes to encode the first million numbers, encode the last million, encode and decode the first million, and finally encode and decode the last million. Here's the code in VB. Public Sub Main() Dim startTime = DateTime.Now For i As UInt64 = 0 To 999999 Dim enc = UInt128ToStr(i, i) Next Dim stopTime = DateTime.Now Console.WriteLine("Encode took " & (stopTime - startTime).TotalMilliseconds) startTime = DateTime.Now For i As Int64 = -1L To -1000000 Step -1 Dim enc = UInt128ToStr(CULng(i), CULng(i)) Dim msb, lsb As UInt64 StrToUInt128(enc, msb, lsb) Next stopTime = DateTime.Now Console.WriteLine("Encode big took " & (stopTime - startTime).TotalMilliseconds) startTime = DateTime.Now For i As UInt64 = 0 To 999999 Dim enc = UInt128ToStr(i, i) Dim msb, lsb As UInt64 StrToUInt128(enc, msb, lsb) Next stopTime = DateTime.Now Console.WriteLine("Roundtrip took " & (stopTime - startTime).TotalMilliseconds) startTime = DateTime.Now For i As Int64 = -1L To -1000000 Step -1 Dim enc = UInt128ToStr(CULng(i), CULng(i)) Dim msb, lsb As UInt64 StrToUInt128(enc, msb, lsb) Next stopTime = DateTime.Now Console.WriteLine("Roundtrip big took " & (stopTime - startTime).TotalMilliseconds) End Sub Although both Java and .NET are probably more than fast enough, Java was 2-3 times faster at encoding, but a round trip encode/decode took about the same time for both. (Further testing of decode seemed to show that the .NET version really is faster at decoding.) It's not worth it to me at this point to figure out why, but if someone is interested I would recommend using ILDASM to inspect the generated IL assembly code for the .NET version. I don't think this is the kind of thing that's going to be helped by source analysis or profiling. Java Encode took 188
VB Encode took 555.6285
For comparison I also found source code online for a fast Base64 encoder (http://migbase64.sourceforge.net/). Base64 Encode took 375
This is not a slight on Mikael Grev's algorithm at all, because mine doesn't even pretend to generate compliant RFC2054 Base64 encoded strings. In the end, I think I came up with a pretty tight little algorithm to solve a particular problem. I hope someone finds it useful. January 28 Encoding 128bit Numbers as StringsMy current project needed a way to convert between 128 bit numbers (e.g. java.util.UUID) and strings. A special requirement is that the strings be legal Java identifiers, because we use them for variable names in generated code. For the same reason, I wanted to ensure that the strings were of minimal size. I was able to come up with a trivial algorithm in a few minutes that works OK, but I felt that I should be able to come up with the most optimal solution given a little more time. It turns out that the optimal solution consumed most of this weekend, but I finally got it working. There's no way I can bill my customer for this exercise, so I thought I'd post it here in case anyone else finds it interesting. I'm also curious to see if anyone can come up with a better or faster solution. Basically my approach is to treat the 128 bit number as up to 21 groups of 6 bits each. This gives me 64 possible characters which works out perfectly, because Java has exactly 64 legal characters for identifiers (0-9, A-Z, a-z, _, and $). I also prepend an additional underscore to avoid illegal starting characters and clashes with keywords and literals. For now, here's the Java source to ponder. Tomorrow I'll discuss the code in more detail, and post a VB version which I actually wrote first, then ported to Java.
| ||