| Justin's profileVarious Technical TopicsPhotosBlog | Help |
|
May 26 Comparing Strings Using Natural OrderingThis topic has been brought up numerous times before, but I wanted to take a shot at another solution, because I think there is still room for improvement, and I wanted a simple example to experiment with some new VB9 functionality. Here are a bunch of links to the most recent information on the problem and a bunch of potential solutions. By spelunking through these, you ought to be able to find a solution in any language you like, and every possible combination of algorithms. http://www.codinghorror.com/blog/archives/001018.html http://www.davekoelle.com/alphanum.html http://nedbatchelder.com/blog/200712/human_sorting.html http://www.interact-sw.co.uk/iangblog/2007/12/13/natural-sorting http://www.codeproject.com/KB/string/NaturalComparer.aspx Probably the simplest example of sorting above is the following Python program which originally came from a comment posted to Ned Batchelder's blog above… 1: def natural_sort(lst): 2: to_int = lambda text: int(text) if text.isdigit() else text 3: alphanum_key = lambda key: [ to_int(c) for c in re.split('([0-9]+)', key) ] 4: lst.sort( key=alphanum_key )Ian Griffiths was even able to more or less duplicate this solution in C# 3 by writing a couple of general purpose utility functions. After a little cleanup, it looks like this… 1: static IOrderedEnumerable<string> NaturalSort(List<string> lst) { 2: Func<string, object> ToInt = s => { 3: try { 4: return int.Parse(s); 5: } catch { 6: } 7: return s; 8: }; 9: return lst.OrderBy(s => Regex.Split(s.Replace(" ", ""), "([0-9]+)") 10: .Select(ToInt), new EnumerableComparer<object>()); 11: } One problem with Ian's solution is that it doesn't do an in-place sort, which makes it not quite a fair comparison with the original. Also, OrderBy uses deferred execution, so you must be careful when comparing the performance that you force the sort to happen (e.g. Call List.GetEnumerator().GetNext()) Of course, if it's fair to write a missing EnumerableComparer class, then why not just implement a NaturalComparer, which would make the resulting code even simpler… lst.Sort(NaturalComparer.CurrentCultureIgnoreCase); or even… lst.OrderBy(s => s, NaturalComparer.CurrentCultureIgnoreCase); This seems to give you the best solution with the added flexibility of working with OrderBy and Sort. (Interestingly, OrderBy often seems to be slightly faster than Sort, which seems strange considering that Sort doesn't have to allocate an entirely new structure to contain the results.) Before I show you my implementation for NaturalComparer, here are some problems found in most of the previous solutions I've seen.
The following is the Compare function from my solution to the problem. You can download the full source from . Public Function Compare(ByVal x As String, ByVal y As String) _ As Integer Implements IComparer(Of String).Compare If x Is Nothing AndAlso y Is Nothing Then Return 0 End If If x Is Nothing Then Return -1 End If If y Is Nothing Then Return 1 End If Dim xpos, ypos As Integer Do While xpos < x.Length AndAlso ypos < y.Length xpos = FindFirstIndexOf(x, xpos, myIsLetOrDig) ypos = FindFirstIndexOf(y, ypos, myIsLetOrDig) If xpos = -1 AndAlso ypos = -1 Then Return 0 ElseIf xpos = -1 Then Return -1 ElseIf ypos = -1 Then Return 1 ElseIf Char.IsNumber(x(xpos)) AndAlso Char.IsNumber(y(ypos)) Then Dim xtmp = FindNextIndexOf(x, xpos, myIsNonZero) Dim ytmp = FindNextIndexOf(y, ypos, myIsNonZero) Dim xend = FindNextIndexOf(x, xpos, myIsNotNum) Dim yend = FindNextIndexOf(y, ypos, myIsNotNum) xpos = If(xtmp = xend, xtmp - 1, xtmp) ypos = If(ytmp = yend, ytmp - 1, ytmp) If xend - xpos < yend - ypos Then Return -1 ElseIf xend - xpos > yend - ypos Then Return 1 Else Dim iy = ypos For ix = xpos To xend - 1 If x(ix) < y(iy) Then Return -1 ElseIf x(ix) > y(iy) Then Return 1 End If iy += 1 Next End If ElseIf Char.IsNumber(x(xpos)) Then Return -1 ElseIf Char.IsNumber(y(ypos)) Then Return 1 Else Dim xend = FindNextIndexOf(x, xpos, myIsNotLet) Dim yend = FindNextIndexOf(y, ypos, myIsNotLet) Dim l = xend - xpos Dim r = myCmpInfo.Compare(x, xpos, l, y, ypos, l, myCmpOpt) If r <> 0 Then Return r End If xpos = xend - 1 ' -1, because we're about to +1 ypos = yend - 1 End If xpos += 1 ypos += 1 Loop If xpos >= x.Length AndAlso ypos >= y.Length Then Return 0 ElseIf xpos >= x.Length Then Return -1 ElseIf ypos >= y.Length Then Return 1 End If Return 0 End Function The basic idea is to iterate forward through the two input strings, using the FindXXX functions to find the separators between numbers, letters, and ignored characters. We return from Compare as soon as possible without having to look at any characters following the first difference. No extra allocations are performed. For example, when comparing strings I used the form of CompareInfo.Compare that takes offsets and lengths for the two strings to avoid having to allocate substrings. Originally the code used inline lambda expressions for the arguments to FindXXX, however testing showed a significant performance increase from predefining those functions outside any individual Compare, as they are always the same. For example, here's the definition for myIsNonZero, and the others are much the same. Private Shared ReadOnly myIsNonZero As CharPred = Function(c) c <> "0"c Finally, here's what FindNextIndexOf looks like… 1: Public Delegate Function CharPred(ByVal c As Char) As Boolean 2: 3: Public Function FindNextIndexOf(ByVal s As String, _ 4: ByVal start As Integer, ByVal p As CharPred) As Integer 5: 6: If s Is Nothing OrElse s.Length = 0 Then 7: Return s.Length 8: End If 9: For i = start To s.Length - 1 10: Dim c = s(i) 11: If p(c) Then 12: Return i 13: End If 14: Next 15: Return s.Length 16: End Function Some NumbersI manually did a few test comparisons between my solution and an optimized form of Ian's (compiled regex and int.TryParse) I compared using two lists of strings shuffled randomly. The first list contained the numbers 1-5000, and the second contained alternating strings, numbers, and special characters of the form "string n string n string n" where n was 1-5000. Optimized IanG, string type one (108-115ms) Optimized IanG, string type two (291-316ms) Mine, string type one (16-20ms) Mine, string type two (158-171ms) Once again, these numbers aren't really fair to compare, because my solution handles things like case insensitivity, ignoring non alphanums, culture idiosyncracies, etc. But it's nice to know that the extra features are more than paid for with the more efficient algorithm. There's still room for improvement, so I may make updates from time to time if I need to use this code on a real project.
Feel free to use the code any way you see fit, and please let me know if you find any problems or have any other suggestions. |
|
|