Justin's profileVarious Technical TopicsPhotosBlog Tools Help

Justin Michel

Occupation
Location
Interests
Photo 1 of 3

Various Technical Topics

May 13

Time for an SSD

The time is right to buy an SSD, because they are finally at the price/performance level where it’s just ridiculous to use anything else. Naysayers will claim that it’s still too early to switch, because you can buy a terabyte hard disk for $50 while a decent 60GB SSD costs $200. However, I think this price difference is irrelevant even in the current economy, because the performance difference is enormous and most of us don’t want or need a drive that big. Those that do need more storage (for a PVR?) can use an extra $50 drive in addition to a primary fast drive.

SSD technology is improving so rapidly that there are a lot of sub-par products still available. For example, you could buy an OCZ Apex 60GB drive for $145, and it will likely have similar performance to the more expensive new OCZ Vertex drives in some benchmarks. However, it may also suffer from some stuttering problems common to this generation of drives. For more info see the recent SSD articles at http://www.anandtech.com/storage/.

The best advice is to buy only what you need as the prices are dropping quickly. Common wisdom is that the price is cut in half each year, however I paid $31/GB for a small MemoRight SLC for my laptop 1.5 years ago, and the latest drives I bought are much higher performance at 1/10th the price. No one predicted that the prices would drop this precipitously, and I see they’re still selling my drive for $21/GB even today. A year from now they’ll probably have terabyte DRAM speed drives for $5 at Walgreens. :-)

The safest bet is probably the Intel X-25M which is available for $630 @ 160GB or about half that price for 80GB. Personally I took a little more risk and bought 4 60GB OCZ Vertex drives for $200 each. I have 2 each set up as RAID0 in my two home computers, and I believe I’m getting much better performance than the Intel drive in most cases, although I’ve had a few headaches. I’ve had to flash firmware updates to the drives to fix bugs and get new features, which erases the drives. I switched to Windows 7 RC while I was at it which has some new features to work better with an SSD.

So what are the benefits?

· SSD is completely silent

· I can read files at ~400MB/s and write at ~300MB/s

· Seek time is ~.1ms

· Applications load instantly and never stutter or freeze.

· Lower power consumption and heat generation

My 2.66Ghz Core 2 Duo NVidia 680Sli machine takes about 40 seconds to start Windows 7 64bit from the point of pushing the power button with half of that time(20.5s) consisting of BIOS stuff. Here’s a picture of a common disk benchmark. Notice that it writes much faster than it reads in many cases. That seems to be an anomaly with this particular motherboard raid controller, and was the same when I was running Vista 32bit. I haven’t noticed it in practice.

My 3.16Ghz Core 2 Duo Intel 975 machine takes 49 seconds to start Windows Vista 32 bit with 16s of that consisting of BIOS stuff (And I hit Esc to skip the memory test.) The same benchmark test on this machine shows much better (and more normal) results.

This SSD thing is also a big deal for software developers. It’s time once again to adjust your perceptions and learn the new physical reality brought on by this change. Stop designing for outdated equipment. Planning your software architecture around the performance characteristics of a mechanical hard disk now makes no more sense than planning to run your software on a Cassette Tape Drive. The rules have changed.  Databases should be re-architected. Persistence should be revisited. Why load your files into in-memory RAM objects when you may get better overall performance leaving them on the disk? Change the way you do things, and your software is likely to be better than legacy solutions. Some companies and people have a hard time coping changing rules, so now’s a great time to challenge the status quo.

Let me close with an analogy.

If a truck were available that can teleport 30 miles at a time, and gets 100,000 mpg on regular gasoline, would you still want to buy a traditional truck? What if you could get a giant dump truck for $500 and the teleporting 100,000 mpg one cost $63,000? What if a sports car version were available for $20,000 that could go 500mph when you didn’t feel like teleporting and still get 100,000mpg? Does it really matter if 5 years from now you could get a 200,000mpg version that could teleport 50 miles at a time for $10,000? City planners, architects, shipping companies, and other affected parties would have to get their act together to take advantage of “physics 2.0” or new governments, communities, and businesses would replace them.

If you are even slightly inconvenienced or annoyed by the speed, noise, stuttering, power consumption, or other problems associated with a legacy mechanical hard disk then a solution is available for as little as $200.

Update 5/20/2009 : Newegg has the G.Skill Falcon 128GB for $309. I think this is identical to the OCZ Vertex 120GB, but ~$70 less.

December 01

My Blog Personality

Try out Typealyzer to find out what your blog says about your personality. Here's mine…

INTJ - The Scientists

The long-range thinking and individualistic type. They are especially good at looking at almost anything and figuring out a way of improving it - often with a highly creative and imaginative touch. They are intellectually curious and daring, but might be pshysically hesitant to try new things.

The Scientists enjoy theoretical work that allows them to use their strong minds and bold creativity. Since they tend to be so abstract and theoretical in their communication they often have a problem communcating their visions to other people and need to learn patience and use conrete examples. Since they are extremly good at concentrating they often have no trouble working alone.

August 02

Quantifying Experience

It's hard to find good developers. And it's worse than useless to filter candidates by "experience" as included on a resume. Statements like "10 years Java experience" or "15 years SQL experience" are meaningless. The problem is that you can't sum up real experience in a single concise number, and if you could it wouldn't be measured in years.

We could try to come up with a formula, but any attempt to do so makes it clear that "Software Developer" encompasses a very wide variety of actual skills. What I'd really like to see on a resume, or in addition to a resume, is a real list summarizing all the things you've actually done, and the lessons you learned while doing them. This includes all of the following…

  1. List each book you've read that seems pertinent to software development, and summarize what you learned. Books are your number one resource for accumulating some subset of the experience of others. If you're not the type to read a book, then I hope you're able to make up for it in other ways. It's not likely though. You might hope to get the same benefit from blogs, but the best of them are just going to restate something that's been said better and more thoroughly in a book.
  2. List each project you have worked on, no matter how small, and what you learned from each one. There's a lot to be learned from projects big and small. I've personally built full products by myself from scratch, and worked on a variety of teams ranging in size from 2 to 20 developers. I've even spent some time working on one of hundreds of small teams working together on a truly massive project with something like 10,000 developers spread throughout the country. I've learned different lessons from each of these experiences, and I feel it adds up to more than "14 years experience". Many people in our industry get sucked into giant corporations, where they work on some tiny piece of a large maintenance-mode project for years at a time. They tend to slow down over time, cranking out a new SQL report once every month or two, and gradually falling further behind. I've nearly been trapped in similar situations myself, so I know you have to be ever-diligent to avoid it.
  3. List the various roles you've played in the software development process, and what you learned from each experience. Have you ever lead a team of people? Have you taken a back seat to another tech lead? Have you ever designed a real software product by yourself? Have you written documentation, or even a book? Have you taught courses, mentored junior developers, pair-programmed, designed web pages, designed a non-web GUI, played the part of a non-technical manager, or any of dozens of other roles that developers may experience in their career? What did you learn? Did you have good mentors in your early years, or did you have to learn everything the hard way?
  4. List all the type of processes and tools you've experienced in real professional development, not counting hobbies you've tinkered with at home (Unless you built something that people actually use.) I've built both GUI and server-side applications, both commercial software products and internal applications, used 8 different languages on non-trivial projects, become proficient in 6 different IDE/Editors, etc. I've found each of these experiences valuable.
  5. List your major successes and failures. There are always some high points and low points that stand out from the rest. When you look back on projects that were less successful than you'd hoped, then take the time to think hard about what really went wrong, and what you personally could have done better. In your internal dialogue, accept full responsibility for any problems, and don't lay the blame on incompetent management, lazy coworkers, indecisive customers, or other distractions. Try instead to imagine what you could have done to make the project a perfect success. Maybe you should have tried harder to understand the customer requirements instead of trusting a Systems Analyst. Maybe you should have found a way to work around management incompetence? Maybe you should have devoted more time to mentoring junior developers instead of forcing them through the trial-by-fire that you suffered through? Or maybe you shouldn't have coddled those junior developers, and let them learn a few things the hard way? Then again maybe you yourself are falling short in some ways. Do you understand the business point of view? Do you over-engineer, under-engineer, fail to test, test too much, write sloppy code, spend too little time designing before coding, or fail in any of thousands of other ways?

The above list probably does give us a fairly good measure of experience. If you can work through each of the above items completely in a fairly terse writing style and finish in less than a month, then you're probably not very experienced. Looking at this list, I can't even estimate how long it would take me to thoroughly explore the 5 items.

I'm not sure, but it might be worth working through the exercise for yourself. Just don't expect anyone else to read it.

My next post, …

Experience Isn't Everything

May 26

Comparing Strings Using Natural Ordering

This topic has been brought up numerous times before, but I wanted to take a shot at another solution, because I think there is still room for improvement, and I wanted a simple example to experiment with some new VB9 functionality.

Here are a bunch of links to the most recent information on the problem and a bunch of potential solutions. By spelunking through these, you ought to be able to find a solution in any language you like, and every possible combination of algorithms.

http://www.codinghorror.com/blog/archives/001018.html

http://www.davekoelle.com/alphanum.html

http://nedbatchelder.com/blog/200712/human_sorting.html

http://www.interact-sw.co.uk/iangblog/2007/12/13/natural-sorting

http://www.codeproject.com/KB/string/NaturalComparer.aspx

Probably the simplest example of sorting above is the following Python program which originally came from a comment posted to Ned Batchelder's blog above…

   1: def natural_sort(lst): 
   2:   to_int = lambda text: int(text) if text.isdigit() else text 
   3:   alphanum_key = lambda key: [ to_int(c) for c in re.split('([0-9]+)', key) ] 
   4:   lst.sort( key=alphanum_key )

Ian Griffiths was even able to more or less duplicate this solution in C# 3 by writing a couple of general purpose utility functions. After a little cleanup, it looks like this…

   1: static IOrderedEnumerable<string> NaturalSort(List<string> lst) { 
   2:     Func<string, object> ToInt = s => { 
   3:         try { 
   4:             return int.Parse(s); 
   5:         } catch { 
   6:         } 
   7:         return s; 
   8:     }; 
   9:     return lst.OrderBy(s => Regex.Split(s.Replace(" ", ""), "([0-9]+)")
  10:         .Select(ToInt), new EnumerableComparer<object>()); 
  11: } 

One problem with Ian's solution is that it doesn't do an in-place sort, which makes it not quite a fair comparison with the original. Also, OrderBy uses deferred execution, so you must be careful when comparing the performance that you force the sort to happen (e.g. Call List.GetEnumerator().GetNext())

Of course, if it's fair to write a missing EnumerableComparer class, then why not just implement a NaturalComparer, which would make the resulting code even simpler…

lst.Sort(NaturalComparer.CurrentCultureIgnoreCase);

or even…

lst.OrderBy(s => s, NaturalComparer.CurrentCultureIgnoreCase);

This seems to give you the best solution with the added flexibility of working with OrderBy and Sort. (Interestingly, OrderBy often seems to be slightly faster than Sort, which seems strange considering that Sort doesn't have to allocate an entirely new structure to contain the results.)

Before I show you my implementation for NaturalComparer, here are some problems found in most of the previous solutions I've seen.

  • They often rely on converting sequences of numeric characters to integers or some other numeric type. This is completely unnecessary, and if you think about what must be going on behind the scenes, fairly costly. Maybe I'll post an article on how to convert strings to integers so that you can get an appreciation for how much work is going on behind that seemingly innocent call to int.Parse.
  • Even if converting to Integers were a good idea, it wouldn't be advisable to throw and catch exceptions repeatedly as part of the algorithm. This can be fixed by using int.TryParse in the code above, but there is no point since we don't want to go down that road at all because…
  • Converting to Integers doesn't work for larger sequences of Strings. For example, try calling ToInt(new String('1', 500) and see what happens.
  • Many solutions use a split() function to divide each string into a list of strings that are either numbers or alphabetic strings. This is completely unnecessary, and wasteful. For example, if I want to compare "1a2a3a…Na" with "2a3a…Ma", then this (silly, stupid, wasteful, inefficient, …) algorithm would allocate two lists of size N and M, whereas a reasonable algorithm would realize the string1 is less than string2 after comparing the first character.
  • Many solutions also tend to use regular expressions to do the split(). Although, this may be convenient, it's not a very efficient approach. I've seen some solutions that use a simple lookup table for the '0'-'9' to help somewhat, but the split() approach seems like a dead end anyway.
  • By the way, if you were to use a Regex here, then many platforms (including .NET) let you compile it to get better performance. When I compare performance later, I'll use a modified version of Ian's code that uses a compiled Regex and int.TryParse.
  • Most of the solutions I've seen ignore all culture issues with string comparison. Considering that the most likely usages are for sorting lists of user visible strings this seems pretty shortsighted.
  • Many solutions are also case sensitive, which is also pretty fundamentally wrong. Any correct NaturalSorting solution should at least have the option to sort case-insensitively.
  • Another requirement for NaturalSorting is to support ignoring non-alphanumeric characters. Typically you want "a " = "a", but "a b" != "ab". In general terms any character that is not a number or letter should only act as a separator. Many solutions, including the python above, ignore this requirement.
  • A general class of problems with almost every solution is the side effect of creating lots of stuff during comparison. Some solutions dynamically create temporary strings, lists of strings, numbers, or even more complex objects. Sometimes this creation is somewhat hidden in a call to string.substring on a platform with immutable strings. Some authors try to use something like StringBuilder to help, but this does not address the real issue. NaturalCompare can probably be implemented without any need for dynamic allocation at all.
  • Another class of problems is algorithms that tend to do the same thing twice. One example is using a regex to divide an input string into a list of strings and numbers, and then having to use another pass to check whether each on is a number again. It would be better to record the fact of whether it's a string or number on the first pass so you don't have to do the same work again. However, I confess that my own algorithm often has to make two passes through each section of a substring, once to search for a delimiter, and once to do the comparison.

The following is the Compare function from my solution to the problem. You can download the full source from .

Public Function Compare(ByVal x As String, ByVal y As String) _
    As Integer Implements IComparer(Of String).Compare

    If x Is Nothing AndAlso y Is Nothing Then
        Return 0
    End If
    If x Is Nothing Then
        Return -1
    End If
    If y Is Nothing Then
        Return 1
    End If
    Dim xpos, ypos As Integer
    Do While xpos < x.Length AndAlso ypos < y.Length
        xpos = FindFirstIndexOf(x, xpos, myIsLetOrDig)
        ypos = FindFirstIndexOf(y, ypos, myIsLetOrDig)
        If xpos = -1 AndAlso ypos = -1 Then
            Return 0
        ElseIf xpos = -1 Then
            Return -1
        ElseIf ypos = -1 Then
            Return 1
        ElseIf Char.IsNumber(x(xpos)) AndAlso Char.IsNumber(y(ypos)) Then
            Dim xtmp = FindNextIndexOf(x, xpos, myIsNonZero)
            Dim ytmp = FindNextIndexOf(y, ypos, myIsNonZero)
            Dim xend = FindNextIndexOf(x, xpos, myIsNotNum)
            Dim yend = FindNextIndexOf(y, ypos, myIsNotNum)

            xpos = If(xtmp = xend, xtmp - 1, xtmp)
            ypos = If(ytmp = yend, ytmp - 1, ytmp)

            If xend - xpos < yend - ypos Then
                Return -1
            ElseIf xend - xpos > yend - ypos Then
                Return 1
            Else
                Dim iy = ypos
                For ix = xpos To xend - 1
                    If x(ix) < y(iy) Then
                        Return -1
                    ElseIf x(ix) > y(iy) Then
                        Return 1
                    End If
                    iy += 1
                Next
            End If
        ElseIf Char.IsNumber(x(xpos)) Then
            Return -1
        ElseIf Char.IsNumber(y(ypos)) Then
            Return 1
        Else
            Dim xend = FindNextIndexOf(x, xpos, myIsNotLet)
            Dim yend = FindNextIndexOf(y, ypos, myIsNotLet)
            Dim l = xend - xpos
            Dim r = myCmpInfo.Compare(x, xpos, l, y, ypos, l, myCmpOpt)
            If r <> 0 Then
                Return r
            End If
            xpos = xend - 1    ' -1, because we're about to +1
            ypos = yend - 1
        End If
        xpos += 1
        ypos += 1
    Loop
    If xpos >= x.Length AndAlso ypos >= y.Length Then
        Return 0
    ElseIf xpos >= x.Length Then
        Return -1
    ElseIf ypos >= y.Length Then
        Return 1
    End If
    Return 0
End Function

The basic idea is to iterate forward through the two input strings, using the FindXXX functions to find the separators between numbers, letters, and ignored characters. We return from Compare as soon as possible without having to look at any characters following the first difference. No extra allocations are performed. For example, when comparing strings I used the form of CompareInfo.Compare that takes offsets and lengths for the two strings to avoid having to allocate substrings. Originally the code used inline lambda expressions for the arguments to FindXXX, however testing showed a significant performance increase from predefining those functions outside any individual Compare, as they are always the same. For example, here's the definition for myIsNonZero, and the others are much the same.

      Private Shared ReadOnly myIsNonZero As CharPred = Function(c) c <> "0"c  

Finally, here's what FindNextIndexOf looks like…

   1: Public Delegate Function CharPred(ByVal c As Char) As Boolean
   2:  
   3: Public Function FindNextIndexOf(ByVal s As String, _
   4:     ByVal start As Integer, ByVal p As CharPred) As Integer
   5:     
   6:     If s Is Nothing OrElse s.Length = 0 Then
   7:         Return s.Length
   8:     End If
   9:     For i = start To s.Length - 1
  10:         Dim c = s(i)
  11:         If p(c) Then
  12:             Return i
  13:         End If
  14:     Next
  15:     Return s.Length
  16: End Function

Some Numbers

I manually did a few test comparisons between my solution and an optimized form of Ian's (compiled regex and int.TryParse) I compared using two lists of strings shuffled randomly. The first list contained the numbers 1-5000, and the second contained alternating strings, numbers, and special characters of the form "string n string n string n" where n was 1-5000.

Optimized IanG, string type one (108-115ms)

Optimized IanG, string type two (291-316ms)

Mine, string type one (16-20ms)

Mine, string type two (158-171ms)

Once again, these numbers aren't really fair to compare, because my solution handles things like case insensitivity, ignoring non alphanums, culture idiosyncracies, etc. But it's nice to know that the extra features are more than paid for with the more efficient algorithm.

There's still room for improvement, so I may make updates from time to time if I need to use this code on a real project.

  • The code doesn't handle floating point numbers or number separators (e.g. ','). If this is added, then it should probably be optional, as it's not clear you would always want it.
  • Although I have some unit tests, I don't have anything to verify the Culture constraints. For example, I'm trusting Char.IsLetter to work correctly.
  • I also assume that numeric characters can be compared by ordinal position. (e.g. '0' < '9')

Feel free to use the code any way you see fit, and please let me know if you find any problems or have any other suggestions.

February 05

128bit Encoding Explained

First, here's the promised VB version of the code. It's almost identical to the Java version, except that...
  1. Unsigned integers remove the need for >>>= (The java unsigned shift right operator)
  2. Pass by reference eliminates the need for the Int128 wrapper class, and simplifies the logic in the strToInt function slightly.
  3. Lack of decrement operator means two statements are required to access the character buffer and decrement the index.
  4. Support for Modules makes it clear that I'm providing utility functions. (As compared to using static methods in a final class.)
  5. Type inferencing eliminates the need to explicitly declare the type for local variables. (Not sure that I like this feature yet, because now it's sometimes hard to tell the type at a glance.
  6. The Select Case statement used in CharToMod64() is much easier to read than the corresponding Java if statements.
Module Utils

    Const bUnder As Byte = 95
    Const bDollar As Byte = 36
    Const bQuest As Byte = 63
    Const bFirstUAlpha As Byte = 65
    Const bFirstLAlpha As Byte = 97
    Const bFirstNum As Byte = 48
    Const mask6Bits As UInt64 = 63
    Const mask4bits As UInt64 = 15
    Const mask2Bits As UInt64 = 3
    Const bitsPerChar As Integer = 6

    Dim CharMap() As Char = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$".ToCharArray

    Function UInt128ToStr(ByVal msb As UInt64, ByVal lsb As UInt64) As String
        Const MAX_BYTE As Integer = 31
        Dim buf(0 To MAX_BYTE) As Char
        Dim i = MAX_BYTE

        ' 64 bit number has ten 6bit encoded values plus four bits left over
        For n = 1 To 10
            If lsb = 0 AndAlso msb = 0 Then
                Exit For ' Eliminate leading zeros
            End If
            Dim b = CByte(lsb And mask6Bits)
            buf(i) = CharMap(b)
            i -= 1
            lsb >>= bitsPerChar
        Next

        ' 4 bits from the lsb, and 2 from the msb
        If lsb > 0 OrElse msb > 0 OrElse i = MAX_BYTE Then
            Dim leftOver = lsb And mask4bits
            Dim firstTwo = msb And mask2Bits
            Dim b = CByte((firstTwo << 4) Or leftOver)
            buf(i) = CharMap(b)
            i -= 1
            msb >>= 2
        End If

        Do While msb <> 0
            Dim b = CByte(msb And mask6Bits)
            buf(i) = CharMap(b)
            i -= 1
            msb >>= bitsPerChar
        Loop

        ' It's easiest to simply prefix an underscore to avoid
        ' illegal identifiers and clashes with keywords and literals.
        buf(i) = "_"c
        i -= 1

        Return New String(buf, i + 1, MAX_BYTE - i)
    End Function

    Public Sub StrToUInt128(ByVal s As String, ByRef msb As UInt64, ByRef lsb As UInt64)
        Dim buf() = ASCIIEncoding.ASCII.GetBytes(s)
        Dim maxByte = buf.Length - 1
        Dim i = maxByte
        Dim minByte = 0
        If buf(0) = bUnder Then
            minByte = 1
        End If
        msb = 0
        lsb = 0

        For n = 0 To 9
            If i < minByte Then
                Return
            End If
            Dim b = CharToMod64(buf(i))
            If b <> 0 Then
                If n <> 0 Then
                    b <<= (n * bitsPerChar)
                End If
                lsb = lsb Or b
            End If
            i -= 1
        Next

        If i >= minByte Then
            Dim b = CharToMod64(buf(i))
            If b <> 0 Then
                Dim leftOver = b And mask4bits
                msb = (b >> 4) And mask2Bits
                leftOver <<= 60
                lsb = lsb Or leftOver
            End If
            i -= 1
        End If

        For m = 0 To 10
            If i < minByte Then
                Return
            End If
            Dim b = CharToMod64(buf(i))
            If b <> 0 Then
                b <<= (m * bitsPerChar + 2)
                msb = msb Or b
            End If
            i -= 1
        Next
    End Sub

    Private Function CharToMod64(ByVal c As Byte) As Byte
        Const b10 As Byte = 10
        Select Case c
            Case bFirstNum To bFirstNum + 10
                Return c - bFirstNum
            Case bFirstUAlpha To bFirstUAlpha + 26
                Return c - bFirstUAlpha + b10
            Case bFirstLAlpha To bFirstLAlpha + 26
                Return c - bFirstLAlpha + bDollar
            Case bUnder
                Return 62
            Case bDollar
                Return 63
            Case Else
                Debug.Assert(False, "Unexpected Character " & c)
                Return 0
        End Select
    End Function

End Module

Turning Numbers Into Strings

Because neither language has native support for 128bit integers, we have to resort to using two 64 bit integers. The algorithm I devised simply encodes every 6 bits of the input integers as a single character. I thought this worked out perfectly, because my first reading of Section 3.8 of the Java Language Spec lead me to believe that Java supports A-Z, a-z, 0-9, $, and _ as the only legal characters in identifiers. I learned later that it actually supports many more, so this algorithm could probably be extended to use N bits instead of only 6, which would make the resulting strings even smaller. In the end, I decided against this, because the current choice is less likely to have problems with character sets and encoding/decoding.

With 6 bits per character and 128 bits total, you can see that (128/6= 21.3~) at most 22 characters are required to hold any number. To ensure legal Java identifiers we also prefix an underscore to each generated string.

The first step in the algorithm is to allocate a temporary buffer to hold the decoded characters.  I used 32, because I'm a power-of-two kind of guy.

Next we process the least significant 64 bit number, repeatedly masking out the lowest 6 bits, converting them to a Character using a lookup table, and then shifting right 6 bits to get the next piece. In Java, we must use the unsigned shift operator for this, because a signed shift will pull in 1's from the left, destroying the number. Notice that we only break out of the loop when we've either taken the first 60 bits of the 64 bit number, or both msb and lsb are zero. If only lsb is zero, then we just repeatedly divide 0/6 until 60 bits have been processed. This ensures that we get any trailing zeros. (e.g. Because 100 <> 100000.)

The middle section of code is to handle the remaining 4 bits from the lsb, and the first 2 bits of the msb. This is accomplished using masks and shifts appropriately. The i = MAX_BYTE condition is there to ensure that the number zero is written out as "_0" instead of "_".

Finally we process the remaining 62 bits from the msb. As soon as msb = 0 we're free to break out of the loop, because we don't want any leading zeros to pad the string.

One nice property of this algorithm is that it encodes sequential numbers in an efficient and almost human readable format. The first 10 numbers are just encoded as "_0" through "_9", followed by "_A" through "_Z", "_a" through "_z", then "_" and "$". It then rolls over to the next digit with the next 64 encoded as "_10" through "_1$". Anyone used to hexidecimal to string encoding should find this intuitive.  Furthermore it should be obvious that for random 128bit numbers this algorithm gives an optimal encoding to 64 characters.

Turning Strings Into Numbers

Reversing the process is only slightly trickier than the initial encoding. First we convert the input string into an array of Bytes. One complication is that I didn't want to assume in the decoder that every input string would start with an underscore, so we check for this at the start and set minByte to either 1 or 0. This was originally there, because in the original Int128ToString() function I only prepended the underscore when necessary to make a legal Java identifier. However, I removed that code due to all the complications in checking for keyword and literal conflicts, and the flexibility in the decoding seemed nice so I left it.

Just as with encoding, the decoder steps through the first 60 bits worth of characters. This time, we use a function to convert the input characters back to base64 numbers using CharToMod64. Each pass through the loop using a bitwise OR operation to combine the returned base64 number with the current value of the lsb.  The trick here is that I found it most straightforward to shift the decoded base64 number to the correct position (b <<= n * 6) and then combine with the lsb using a bitwise OR.

The middle section once again converts the 6 bits from the decoded number into the remaining 4 bits for the lsb and the first 2 bits of the msb.

The final section processes the remaining characters into the final 62 bits of the msb.

Performance

I found a few interesting things with the performance of this algorithm.

First, I wrote a simple program to time how long it takes to encode the first million numbers, encode the last million,  encode and decode the first million, and finally encode and decode the last million. Here's the code in VB.

Public Sub Main()
    Dim startTime = DateTime.Now
    For i As UInt64 = 0 To 999999
        Dim enc = UInt128ToStr(i, i)
    Next
    Dim stopTime = DateTime.Now
    Console.WriteLine("Encode took " & (stopTime - startTime).TotalMilliseconds)

    startTime = DateTime.Now
    For i As Int64 = -1L To -1000000 Step -1
        Dim enc = UInt128ToStr(CULng(i), CULng(i))
        Dim msb, lsb As UInt64
        StrToUInt128(enc, msb, lsb)
    Next
    stopTime = DateTime.Now
    Console.WriteLine("Encode big took " & (stopTime - startTime).TotalMilliseconds)

    startTime = DateTime.Now
    For i As UInt64 = 0 To 999999
        Dim enc = UInt128ToStr(i, i)
        Dim msb, lsb As UInt64
        StrToUInt128(enc, msb, lsb)
    Next
    stopTime = DateTime.Now
    Console.WriteLine("Roundtrip took " & (stopTime - startTime).TotalMilliseconds)

    startTime = DateTime.Now
    For i As Int64 = -1L To -1000000 Step -1
        Dim enc = UInt128ToStr(CULng(i), CULng(i))
        Dim msb, lsb As UInt64
        StrToUInt128(enc, msb, lsb)
    Next
    stopTime = DateTime.Now
    Console.WriteLine("Roundtrip big took " & (stopTime - startTime).TotalMilliseconds)
End Sub

Although both Java and .NET are probably more than fast enough, Java was 2-3 times faster at encoding, but a round trip encode/decode took about the same time for both. (Further testing of decode seemed to show that the .NET version really is faster at decoding.) It's not worth it to me at this point to figure out why, but if someone is interested I would recommend using ILDASM to inspect the generated IL assembly code for the .NET version. I don't think this is the kind of thing that's going to be helped by source analysis or profiling.

Java

Encode took 188
Encode big took 249
Roundtrip took 763
Roundtrip big took 985

VB

Encode took 555.6285
Encode big took 565.3935
Roundtrip took 778.2705
Roundtrip big took 917.91

For comparison I also found source code online for a fast Base64 encoder (http://migbase64.sourceforge.net/).

Base64

Encode took 375
Encode big took 374
Roundtrip took 1352
Roundtrip big took 1332

This is not a slight on Mikael Grev's algorithm at all, because mine doesn't even pretend to generate compliant RFC2054 Base64 encoded strings.

In the end, I think I came up with a pretty tight little algorithm to solve a particular problem. I hope someone finds it useful.

January 28

Encoding 128bit Numbers as Strings

My current project needed a way to convert between 128 bit numbers (e.g. java.util.UUID) and strings. A special requirement is that the strings be legal Java identifiers, because we use them for variable names in generated code. For the same reason, I wanted to ensure that the strings were of minimal size.

I was able to come up with a trivial algorithm in a few minutes that works OK, but I felt that I should be able to come up with the most optimal solution given a little more time. It turns out that the optimal solution consumed most of this weekend, but I finally got it working.

There's no way I can bill my customer for this exercise, so I thought I'd post it here in case anyone else finds it interesting. I'm also curious to see if anyone can come up with a better or faster solution.

Basically my approach is to treat the 128 bit number as up to 21 groups of 6 bits each. This gives me 64 possible characters which works out perfectly, because Java has exactly 64 legal characters for identifiers (0-9, A-Z, a-z, _, and $). I also prepend an additional underscore to avoid illegal starting characters and clashes with keywords and literals.

For now, here's the Java source to ponder. Tomorrow I'll discuss the code in more detail, and post a VB version which I actually wrote first, then ported to Java.

001 
002 public final class Utils {
003 
004     private static final byte bUnder = 95;
005     private static final byte bDollar = 36;
006     private static final byte bQuest = 63;
007     private static final byte bFirstUAlpha = 65;
008     private static final byte bFirstLAlpha = 97;
009     private static final byte bFirstNum = 48;
010     private static final long mask6Bits = 63;
011     private static final long mask4Bits = 15;
012     private static final long mask2Bits = 3;
013     private static final int bitsPerChar = 6;
014 
015     private static final char[] charMap = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$"
016                             .toCharArray();
017 
018     public final static class Int128 {
019         public final long msb;
020         public final long lsb;
021         public Int128(long m, long l) {
022             msb = m;
023             lsb = l;
024         }
025         public Int128() {
026             this(00);
027         }
028         @Override
029         public final String toString() {
030             return "" + msb + "," + lsb;
031         }
032     }
033 
034     public static String int128ToStr(long msb, long lsb) {
035         final int MAX_BYTE = 31;
036         char[] buf = new char[32];
037         int i = MAX_BYTE;
038 
039         // Process the least significant bits
040         // 64 bit number has 10 six bit encoded values plus four bits left over
041         for (int n = 0; n < 10; ++n) {
042             if (lsb == && msb == 0) {
043                 break// Eliminate leading zeros
044             }
045             byte b = (byte) (lsb & mask6Bits);
046             buf[i--= charMap[b];
047             lsb >>>= bitsPerChar;
048         }
049 
050         // 4 bits from the lsb, and 2 from the msb
051         if (lsb != || msb != || i == MAX_BYTE) {
052             long leftOver = lsb & 15;
053             long firstTwo = msb & 3;
054             byte b = (byte) ((firstTwo << 4| leftOver);
055             buf[i--= charMap[b];
056             msb >>>= 2;
057         }
058 
059         // Process the most significant bits
060         while (msb != 0) {
061             byte b = (byte) (msb & mask6Bits);
062             buf[i--= charMap[b];
063             msb >>>= bitsPerChar;
064         }
065 
066         // It's easiest to simply prefix an underscore to avoid
067         // illegal identifiers and clashes with keywords and literals.
068         buf[i--'_';
069 
070         return new String(buf, i + 1, MAX_BYTE - i);
071     }
072 
073     public static Int128 strToInt128(String s) {
074         byte[] buf = s.getBytes();
075         int maxByte = buf.length - 1;
076         int i = maxByte;
077         int minByte = 0;
078         if (buf[0== bUnder) {
079             minByte = 1;
080         }
081         long msb = 0;
082         long lsb = 0;
083 
084         for (int n = 0; n < 10; ++n) {
085             if (i < minByte) {
086                 return new Int128(msb, lsb);
087             }
088             long b = charToMod64(buf[i]);
089             if (b != 0) {
090                 if (n != 0) {
091                     b <<= (n * 6);
092                 }
093                 lsb |= b;
094             }
095             --i;
096         }
097 
098         if (i >= minByte) {
099             long b = charToMod64(buf[i]);
100             if (b != 0) {
101                 long leftOver = b & mask4Bits;
102                 msb = (b >>> 4& mask2Bits;
103                 leftOver <<= 60;
104                 lsb |= leftOver;
105             }
106             --i;
107         }
108 
109         for (int n = 0; n < 11; ++n) {
110             if (i < minByte) {
111                 return new Int128(msb, lsb);
112             }
113             long b = charToMod64(buf[i]);
114             if (b != 0) {
115                 b <<= (n * bitsPerChar + 2);
116                 msb |= b;
117             }
118             --i;
119         }
120         return new Int128(msb, lsb);
121     }
122 
123     private static byte charToMod64(byte c) {
124         if (c >= bFirstNum && c <= bFirstNum + 10) {
125             return (byte) (c - 48);
126         else if (c >= bFirstUAlpha && c <= bFirstUAlpha + 26) {
127             return (byte) (c - bFirstUAlpha + 10);
128         else {
129             if (c >= bFirstLAlpha && c <= bFirstLAlpha + 26) {
130                 return (byte) (c - bFirstLAlpha + 26 10);
131             else if (c == bUnder) {
132                 return 62;
133             else if (c == bDollar) {
134                 return 63;
135             else {
136                 assert false;
137                 return 0;
138             }
139         }
140     }
141 }
Java2html
October 16

Windows Environment Editor released on SourceForge

I finally got around to submitting the Windows Environment Editor to SourceForge. You may recall this as the project originally created for the ill-fated OCI Summer of Code programming contest. We haven't done much work on it since then, but the major bugs have been addressed, and I've been using it regularly, as have a few brave volunteers. I think it's the best of the many replacement environment editors currently available, but you can judge that for yourself.

Technologies

One of my primary purposes for this project was to learn how to create GUI applications using the relatively new Windows Presentation Foundation (WPF 1.0) for .NET. This framework, while similar to others in many respects, is really something fundamentally different than libraries like Swing, WinForms, GTK, MFC, etc. For more information I recommend reading http://arstechnica.com/reviews/os/pretty-vista.ars

One of the benefits of WPF is the separation between GUI design using a markup language (XAML) similar to HTML, and a programming API for making that GUI work. In theory, this will allow designers to use programs like Microsoft Expression Blend to make applications like mine look better. For this reason, we tried to express as much of the GUI as possible using XAML, rather than reverting to code. We also wasted huge amounts of time using Blend to play with the GUI design, but in the end we gave up and went with a simple (ugly?) design that eschews all the fancy gradient effects.

Of course, for reasons best explained in my previous posts, I used Visual Basic 2005 as the programming language. All told the application is something less that 2000 lines of code&XAML.

In the future, maybe I'll port everything to the upcoming new releases for WPF and VB.

Features

  • Resizable dialog allows viewing more variables at once.
  • System and User settings are contained in collapsible sections.
  • Search box to quickly jump to a variable for editing.
  • Modify variables directly in the search box using Variable=Value syntax.
  • Edit variables directly without having to popup an additional dialog.
  • Both keyboard and mouse interfaces are supported.
  • Keyboard shortcuts for common tasks.
  • Size, Position, and other state is remembered between runs.
  • Vista support for running as a user or administrator.
  • Many small details enhance the user experience.

Conclusion

I estimate that this application represents several hundred hours of work by myself and Ngan that may have been better spent playing Scrabble or Age Of Empires. I hope that at least a few people find it useful. If you have any questions, want to contribute features or fixes, or just want to try it out, I encourage you to check out the project.

https://sourceforge.net/projects/winenvedit/

July 29

NOT the Worlds Most Advanced Mouse

I thought I'd share my experience with a recent consumer electronics purchase. Maybe I can save you some pain and trouble, and salvage something from my own wasted effort.

The Problem

I started the search for a new mouse, because I am occasionally annoyed by the cord on my other mice, and I was hoping that wireless mouse technology was finally at a usable point. I first tried cordless mice two years ago using a Christmas gift certificate to purchase the top-of-the-line Microsoft solution. (Wireless Laser 6000). However, I found this to be completely unusable, because of it's penchant for missing clicks, and poor tracking.

The Solution?

So, I plunked down $100 last Wednesday to purchase a Logitech MX Revolution, which claims to be "The Worlds Most Advanced Mouse". I was disappointed to find that this mouse shares many of the flaws the MS Wireless mouse, and even manages to bring back some problems I thought we'd left behind with the old ball mice, and has upped the ante with a product design that is so flawed that at first could not believe it. But first…

What I Want In a Mouse

5 Buttons

I routinely make use of 5 buttons while using a mouse. I use the obvious left button to click and double-click my way through GUI interfaces. I use the right button to pull up context menus. I use the middle button to open browser links in separate tabs and to close those tabs with a single click. And I use the back and forward button to retrace my steps through browser-style interfaces. Any other buttons are more likely to be accidentally pressed than activated on purpose, and should be minimal and unobtrusive, or better yet, absent.

Scroll Wheel

I constantly use the scroll wheel (if I'm using the mouse at all) for anything with a document-like interface. However, I'm skeptical of the value of side-to-side scrolling found on many newer mice. This might be useful for working with large pictures or something, but I don't see any value for me, and would always choose to disable the feature.

Still, the scroll wheel has a lot of room for improvement. For me, it often scrolls too much with each step. Even the default Windows mouse driver assumes that the scroll wheel should scroll one or more lines at minimum. What I really want is something more precise, which I can currently only get by clicking and dragging the scroll bar "puck" or "thumb". This is an area where the MX Revolution really had a chance to innovate to fix one of my major gripes with current scroll wheel implementations.

Precision

Many people don't notice the precision of their mouse, but the quickest way I know to grasp the concept is to either a) Try to write your name using a paint program, or b) try to play a game. Under these and similar scenarios, differences in mouse tracking become very apparent. When I'm playing Age Of Empires III, I only have so much time to click all the little guys to tell them where to attack. If my mouse causes me to miss, or makes it difficult to select the right units in the heat of battle, then it makes me upset. Of course, if you're observant you can readilly notice the difference between a good mouse and a poor one when using any GUI. A good mouse makes it much easier to click on the myriad buttons, hyperlinks, and other elements of the modern graphical interface.

Clicking

This should go without saying, but the mouse buttons should respond as expected when clicked. Both of the wireless mice I've used have occasionally failed to register a click the first time I've pressed a button. This could be a driver issue, or a problem related to RF interference, or any number of things, but it's definitely unacceptable.

The Review

I went with the Logitech, because I've been extremely happy with a MX518 wireless gaming mouse, which is, to date, the best mouse I've ever owned. My hope was that they would essentially provide an MX518 in wireless form, perhaps with fixes for my only few gripes with the wired version (Doesn't remember my DPI setting between reboots, and I occasionally accidentally hit the dpi and property buttons. )

Gimmick One, A New Wheel

The coolest thing about the new mouse, is the idea for a new scroll wheel. They installed this nice heavy wheel in the center which works in one of two software-controlled modes. In the first mode it behaves like any other scroll wheel, clicking as you spin it, and scrolling N lines for each click. The second mode, disengages something internally, allowing the wheel to freely spin. This could have been so cool had it been implemented correctly. What I expected was that the scroll wheel would finally give the illusion of a physical connection to the document. If I moved it the tiniest amount, then the document should scroll by a single pixel or less. If I were to spin it, then the document should scroll at the speed of the wheel until it lost momentum or I stopped the spinning manually. This is NOT what happened. Instead, the free-spinning mode seemed to simply control the same N-line scroll wheel mechanism as before. Perhaps a driver update could fix this issue, by tying the free spinning mode to sub-pixel manipulation of the scroll thumb instead of hooking into the scroll wheel mechanism. This wouldn't quite be the right effect for long documents, because you'd want to limit the maximum speed, but it would be better, and I'm not sure if perfection can be achieved without specific software support.

The scroll wheel also supports the ubiquitous horizontal scrolling feature that I dislike, but I can't remember if it was possible to disable it.

No Middle Button

When using the Logitech mouse driver, pushing the scroll wheel either switches scrolling modes between free-spinning and "clicking", or can be disabled entirely. There is no provision at all for supporting a middle button, which I personally find totally unacceptable, especially considering the uselessness of their scroll wheel feature.

Gimmick Two, Search

Logitech added a new button behind the scroll wheel, which can only be used for their new search feature. When clicked on a highlighted word, presumably it will pull up the word the word in the search engine of your choice. I tried the feature only a few times, and never accepted the default Yahoo service. However, it rarely worked, and I was never quite sure what I was doing wrong, or if the feature is just buggy or non-intuitive. In any case, the only thing I really wanted to do was remap this button to control the scroll-wheel mode, and use the scroll wheel button for its usual middle-click duties. Of course, the Logitech software didn't support this. To be fair, clicking the search button on the mouse seemed to do the same thing as clicking the search button on my keyboard (MS Comfort Curve 2000. The best keyboard ever. I bought 3.). It brings up the new Vista search dialog, which I don't really like. Maybe there is simply some software conflict, or Logitech doesn't correctly support Vista.

Gimmick Three, A Button On My Thumb

Logitech added another strange feature to this mouse. There is a little toggle under your right thumb, which can be pressed forward or backward. When you do so, it will popup a very ugly window where you can choose between running applications. This seemed equivalent, but inferior, to the Alt-Tab switcher, or Flip3D found in Windows Vista. I wish I had thought to capture a picture so you could see just how poorly implemented this was. Once again, it was impossible to map these to more logical functions such as the usual forward/back buttons, which might have made sense. It also might have been cool to tie into Flip3D or the Alt-Tab switcher, but I probably would have mapped this feature to the buttons on the left of the mouse, as I would use it much less than browser forward/back navigation.

The Software

Strangely this mouse seems to require more than the usual driver. To enable most of the advanced features you must run the SetPoint software in the system tray. If you exit this program, then the mouse reverts to normal MS mouse behavior. This does re-enable the middle button, but the thumb button and search button are disabled. If I were forced to keep this mouse, then this is probably the way I would try use it.

The biggest problem with the software is the lack of useful customizability. As mentioned above, most features were hardcoded to specific buttons, and could only be disabled rather than reassigned. There were several sliders for controlling the scroll wheel speed and acceleration, but I could find no setting that made this feature usable, or in any way better than any other scroll wheel.

Precision Problems

The most notable problems with this mouse had nothing to do with the flawed or missing features detailed above, but instead reflect my previous experience with a wireless mouse.

Can't Click

Several times while using my computer, I would click on a link while web browsing, or click a Settler in AOE3, only to find that the click didn't register. At the time, I blamed myself or even Windows Vista. I thought that perhaps I'd moved slightly, and the game thought I'd dragged rather than clicked. However, over time, and with back to back comparison with a MX518 wired mouse, it became apparent that the mouse was just occasionally missing clicks. Maybe I have too much interference. I do have a Wireless-G network and 2.4Ghz portable phones. However, I don't really think that's any excuse for Logitech here, because this sort of an environment is commonplace among the user base for this product.

Can't Move

If you've used a mouse before 1998 or so, then you probably remember ball mice, and all the problems associated with them. Somehow Logitech has managed to recreate the experience of using one of these. This probably has the same root cause as my click problem above, but frequently when using this mouse, the pointer would freeze to a spot on the screen, and I could get it to move only by quickly moving the mouse to get it going again. This only seemed to happen when I was moving the mouse slowly in the first place. For example, I might be trying to click on a AOE3 Settler or army unit to issue new orders.

Lest you think that I'm just being picky, my girlfriend also noticed the same problems with both wireless mice while playing Age Of Empires, which was quickly remedied by reverting to an old Microsoft IntelliMouse Optical. These problems occurred with any software, it was just much more noticeable in a game, because you often need to move at a precise rate to click on a moving object.

Summary

In my first hardware review, I give Logitech MX Revolution a 0/10. Although many of the problems might be fixable by future driver, firmware, and software updates, I don't think many of the decisions such as the need to run a separate software program instead of just a driver, will likely ever be addressed.

I know of other people who love this mouse, so feel free to experiment with it yourself, but I advise you to purchase from a local retailer with a forgiving return policy, and to hang on to your receipt. If your experience is anything like mine, then you'll probably regret the wasted effort.

On Friday, I took the mouse back for a refund. In the end I'll probably just buy another Logitech MX518, but I'm still stuck with a wired mouse whose cable occasionally interferes with whatever I'm trying to do. Maybe I should just clean my desk.

July 27

RAII For .NET and Java

One of the complaints that many C++ developers have with Java and .NET is the lack of destructor semantics. I've even included this as a proposal for a Scoped keyword in my post about VB shortcomings. Basically, with C++ you can implement a special method on a class which will automatically be called when instances go out of scope. Note, that this really has absolutely nothing to do with freeing memory, although that was it's most common usage in older (pre-1990?) C++ code. These days this mechanism is much more common, and used for all kinds of things, mostly still involving release of resources. (Files, Mutexes, Database Connections, etc.) It's even become a best practice with its own acronym, RAII (Resource Acquisition Is Initialization).

C# and VB have a similar general-purpose mechanism with Using Statements, but the semantics can get unweildy. Java and older versions of VB are stuck using try/finally (also available in C#).

C++

void foo() {

Connection a("1");

Connection z("26");

// Do stuff

}

C#

void Foo() {

using (Connection a = new Connection("1")

using (Connection z = new Connection("26") {

// Do stuff

}

 

}

VB

Sub Foo

Using a As New Connection("1")

… Using z As New Connection("26")

VB (older versions)

Sub Foo

Try

Dim a As New Connection("1")

Dim z As New Connection("26")

// Do stuff

Finally

a.Dispose()

z.Dispose()

End Try

End Sub

Java

void foo() {

try {

Connection a = new Connection("1");

Connection z = new Connection("26");

} finally {

a.close();

z.close();

}

The C++ code is very clean, because the special destructor method is called to close each connection as they go out of scope at the bottom of the method. The C# code is much more verbose, but stacking the using statements at least alleviate the nesting problems of the VB version. The older VB and java versions are stuck manually closing everything in the finally block, which can be especially cumbersome when you have to check each instance for null.

I think this is definitely an area where these languages need some more syntactic sugar. I think a Scoped keyword would be ideal. Here's how it could look in Java.

void foo() {

scoped Connection a = new Connection("1");

scoped Connection z = new Connection("26");

// Do stuff

}

Each Connection object would be responsible for implementing a special method which would automatically be called when the references went out of scope. This could either tie into the Finalizer mechanism similar to how .NET does it, or it could use a new approach perhaps consisting of a compile-time annotation for the Connection class. (For more information on .NET IDisposable) C# and VB could really use something similar, because the Using syntax often remains too bitter for my taste.

For C# and especially VB, we can probably do better using something like Scott McMaster's suggestion. Here we trade some efficiency for convenience. I would write it like so…

void Foo() {

using (ScopeManager scope = new ScopeManager()) {

Connection a = new Connection("1");

scope.add(a);

Connection z = new Connection("26");

scope.add(z);

}

}

Here we save ~26 lines of using statements, each possibly with its own nesting level. It's still not as nice as the C++ mechanism, but it's available now, and preferrable to any alternative I can think of at the moment. This could potentially be useful in Java too, but less so due to the lack of Using semantics.

 

 

May 22

What Does “Loosely Typed” Mean?

Eric questions the meaning of "Loosely Typed"

I think that there are two separate issues here.

Required Declarations

This corresponds to the "Option Explicit" feature added to VB3, and which still exists in VB.NET. You can choose for a particular file whether to require variable declarations or not. This feature helps catch bugs at compile time where you accidentally misspell the name of some symbol. Consider the following code.

Class Person

    Private name

Private age As Double

Private money As Decimal = 100.0

End Class

The name field is declared, but no type is specified, whereas the age field is both declared and given a type, and money is declared, given a type, and an initial value.

Required Type

This somewhat corresponds to the "Option Strict" feature in VB, except that the feature enables quite a few other behaviors that are somewhat related, such as the ability to implicitly convert between Strings and Numbers. If I have "Option Strict Off" and "Option Explicit Off" at the top of a VB file, then the following will compile.

to = "Hello"

to = to.Length

to = to + too

This shows one problem with making variable declarations optional, because I was able to misspell "to" as "too" in the last line, and the compiler couldn't check it. This is why I like to be able to specify "Option Explicit On" in VB even when I set "Option Strict Off". I prefer to get a little red squiggly underline of "too" rather than waiting to catch this in a unit test.

Inferred Type

VB9 and C#3 add another twist to this issue, because they now support type inference. With this feature, you can still get strong type safety, but without the hassle of explicitly stating the type. Take the following C# example:

var name = "Justin";

var len = name.Length;

name = 42.0;

In this example the first line is exactly equivalent to "string name = "Justin";", and the third line will be a compile time error, because the inferred type of name is String, which doesn't support implicit conversion from Double.

The only thing I don't like about the current implementation of the type inference feature, is that it's done by the compiler. What I'd really like is for the IDE to display the name of the inferred type, and perhaps let me modify its choice without making the type "sticky". Maybe the IDE could automatically display the above as "var string name = "Justin";", where "string" is non-editable and mutates itself automatically to match changes to the inferred type when refactoring.

Strong vs. Loose

These terms are currently pretty ambiguous. I think the term Strong should only mean that at runtime, the code will result in an error if you attempt an operation that is not known to be safe. Examples might include:

  • Calling a derived-type method on a base reference
  • Implicit numeric conversions
  • Calling a method that is unknown on the current reference

The VB "Option Strict Off" setting basically throws "Loose Typing" in with "No Type Declaration Required", and makes the following code legal.

Class A

Public Sub Foo()

End Sub

End Class

Dim v as Object

v = new A()

v.Foo()

v = "42"

v += 1

Assert(v.Equals(43))

This is true Loose Type behavior, and can be useful in certain limited contexts. However, I wish you could have finer-grained control over the VB options so that I could retain some type safety while still disabling the need for explicit type specifications. I guess some of the need for this is obviated by the type-inference feature.

Dynamic vs. Static

This seems to clearly refer to whether type information is known at compile time. Static Typing means that all type errors short of invalid explicit casts will be caught at compile time.

Conclusion

So I guess I'm a little unclear about what kind of type system is found in Ruby. If it claims to be "Strongly Typed" then that's pretty ambiguous. Is there a summary of exactly which operations result in an error vs which are implicitly allowed? What value does "Strongly Typed" provide at runtime? If you're going to wait until runtime to detect errors, then why not just use "Loosely Typed"?

May 21

VB and Language Choice

A recent Coding Horror post got me thinking again about C# vs. Visual Basic. As you know, I myself think that VB is probably the best general purpose managed language available right now, so I thought I'd add my $.02 to the discussion.

VB vs. C# is like Coke vs. Pepsi

I don't think this analogy holds up, because it doesn't take into account the complexities of the situation. I believe that there are real substantive differences that make VB a better tool for writing programs than C#. While it's true that tools like CodeRush and Resharper can greatly improve the C# experience, they still fall short of what's theoretically possible from a VB-based toolset, because, no matter what you do to C# (or Java or any other C-style language) you will always be missing the three key features that make VB better. Namely the Readable Keywords, Line-Orientation, and Case-Insensitivity I detailed in a previous post.

I'm not claiming VB/Visual Studio is better in every way than Eclipse/Java, IntelliJ/Java, Resharper+VS2005/C#, etc. I'm just saying that if all these tools were refined to their highest potential, I would like VB best because it has a better foundation. VB is currently missing some major features such as the ability to automatically handle Imports, that make it difficult to compare it favorably to these other tools.

Case Insensitivity is Right and Case Sensitivity is Wrong

I hope that most VB users would agree that it's not the Case Insensitivity of VB that we actually like. In fact, at the language level I believe it's a mistake for VB to be Case Insensitive. What people really like about VB is that most of the time it's not the author's responsibility to worry about case. The feature that we really like is that the IDE will fix the case to match the declaration or keyword. I think a better way to implement this feature would be to require Case Hyper-Sensitivity at the language level which would disallow multiple symbols in scope that differ only by case.

No competent VB programmer would ever want something like "sySTeM.cONsolE.WRitELINE(foo)" to compile. We just think it's ridiculous that C# and Java IDE editors make us constantly contort our hands with two-key combinations. This only slows down our typing, causes hand cramps, and makes us 2% (est.) less happy. For those of you who've grown accustomed to Resharper, IntelliJ or Eclipse, this is exactly equivalent to many of the features you would miss if switching to Notepad/Java. In fact, my first impression using IntelliJ was that somebody copied 20% of the good features from VB, and then thought of a bunch of new must-have features that I didn't even consider. The problem is that none of the IDE vendors are really doing a good job of incorporating the best features of their competition. What would be ideal is a JetBrains/VB based on Mono, or better yet a completely new language that takes the important ideas of VB while stripping away the too-long keywords, and other features that I dislike in VB. (Overall I like the readable keywords in VB, but some are just too much. )

Strongly Typed

The Ruby and Python movements have convinced many people that they really want a loosely typed language. Of course, one of the great things about VB is that you can choose within a given file whether you want strong or loose typing, but I'm not convinced that most people even truly want loose typing in most cases. What people really seem to like is not having to waste time keying in all the type information. They like the flexibility that the tool can just figure it out.

You can get most of this benefit without having to give up the performance and other benefits of a strongly typed language. Upcoming versions of C# and VB will have a feature called Type Inference, wherein the language will simply figure out the type from the context in which it is used. For instance, in C# 3.0 I can write "var x = foo.ToString();" and C# will infer that the type of x is String. One thing that I don't like about the new VB/C# type-inference feature is that it seems to be implemented at the language level.

Language vs. IDE

I believe that the single most important key to the future of programming is the realization that Programming Languages as classically defined are an idea whose time has passed. Tools like VB, IntelliJ, and Eclipse have blurred the line between IDE and language, but the real key is realizing that the language side of the line is no longer needed at all. In fact, it's a huge detriment. This is the primary reason why I can't really get excited about any new language such as Ruby. I see these all as vestiges of a legacy mode of thinking.

Multi-Language

One of the benefits of .NET was supposed to be that you could use any language that you like. However, as Jeff has noticed this dream has not been realized. The problem is that we're not really free to choose any language we like, because the fact is that most code is going to have to be maintained and written by multiple people each with different preferences. Jeff's experience is that most people have settled on C# as the de-facto .NET standard language despite the superiority of VB. Over time he seems to have been worn down by the C#-zealots, and their fanatical devotion to a language attuned to those with masochistic tendencies, but I have to believe that he still realizes deep-down that he knows a better way.

The fundamental problem with a multi-language platform like .NET is that what we really want is the ability to maintain the same code in multiple languages. If I write a class in VB, then someone else needs to be able to maintain that class in C#, Python, or Ruby without having to translate the language. The key is that the code itself can't know what language it's written in, and that can only happen when we get rid of this silly old-fashioned notion of parsing languages written as text files.

April 13

5 Things

I think this meme jumped the shark a long time ago, so I won't tag anyone else. However, I'll try to come up with my 5 things.
  1. I'm from a relatively small town of 35,000, although I grew up more on the fringes of that metropolis. We had one neighbor next door, and a farm across the street. There weren't any other kids around so I had to learn to play with myself most of the time. ;)
  2. I still play video games, although I haven't upgraded to the Xbox360 yet. I'm currently in the middle of Psychonauts and Half-Life 2, so I'm a little behind the times.
  3. I like to cook. I watch the food network frequently, and I think I'm getting decent at making a few things although I'm a long way from Iron Chef. I did win the 1st annual OCI Chili cookoff last year though. Ok, I tied. :)
  4. I used to be a Ham. I got my first amateur radio license when I was 11 or 12, but I lost interest when I found out the computers could be more than my commodore64.
  5. Although my first computer was a C64, and I wrote some simple BASIC programs for it, I never *really* learned to program until college, when I learned Modula-2. Maybe it's because my first two languages were of the no-curly-braces variety that I prefer Python, VB, and Pascal to C++, Java, C#, or Perl.

When I started this blog, I didn't think I would have any trouble keeping a steady stream of posts going, but it's been a while since my last entry. I'll try to do better going forward.

Thanks for getting me back on track Weiqi. :)

January 18

Things I Would Change in VB

VB is a great language, but I still think it has room for improvement. For example, there are far too many keywords, some of the keywords are confusing, and some keywords have too many overloaded meanings. There are also problems with the IDE that should be addressed. More than any other language, VB has always been about a partnership with the IDE, and this functionality is crucial for its ease of use and power.

Language Changes

Declare instead of Dim for variable declaration

The use of the keyword ‘Dim’ for variable declaration is an historical artifact of Basic. A more descriptive name would be ‘Declare’ which is already a VB keyword, but used for a feature of dubious value. Here are some examples, of what I would prefer, and I'll continue to do so in subsequent examples:

Declare x,y,z As Integer 
Declare names(1 to 10) As String 
Declare teams() As String = GetTeams()
Too many uses of parentheses

I personally find it confusing that parentheses are used for expression grouping, array declarations, indexing, and function calling. I think it would be more readable if we used brackets for indexing and arrays.

Declare names[10] As String 
names[0] = “Justin” 
names[(x + (y \ 2))] = “Michel” 
Console.WriteLine(“Name = “ & GetNamePrefix() & names[2])
GetType, TypeOf

It doesn’t seem like we should need both of these keywords. One option would be to use TypeOf (T) for retreiving the type of a class. I think this is more consistent with VB’s readable syntax. Even better would be to support accessing a Shared GetType method or read-only Type property on any type, which would eliminate the need for a keyword in this use case completely. Best might be to allow the type name itself to be used.

Declare t As Type = FileStream.GetType()

—or—

Declare t As Type = TypeOf(FileStream)

—or—

Declare t As Type = FileStream.Type

—or—

Declare t As Type = FileStream
Is, IsA, and TypeOf

The current keyword "Is" should be used to compare for identity only. If used with two reference types then it returns true if both refer to the same exact object. If used with value types then it returns true if both have the same value. Basically, I’m suggesting elimination of “If TypeOf X Is Y Then” as a special case, although if we retain TypeOf as outlined above, then “If TypeOf(X) Is Y Then” and “If TypeOf(X) = Y Then” are both legal and equivalent.

A new keyword IsA should be introduced to allow easy comparison of types. A IsA B should return true if A is the same type as B, A derives from B, or if B is an interface implemented by A.

Declare A, B As Foo 
A IsA Foo ' returns true 
Foo IsA A ' won’t compile, because left hand side is a type. 
A IsA B ' won’t compile, because B is an object. 
A IsA B.GetType() ' returns true

It’s also confusing that Is must be used in a Select expression. It seems like the extra keyword should be unnecessary.

Select age 
Case 5 To 10, 12 To 15 
Case < 5, 13, 16, > 80 ' No need for “Case Is < 5, 13, 16, Is > 80” 
Case Default 
End Select 
Too many conversion keywords

I think all of CBool, CByte, CChar, CDate, CDbl, etc. can be eliminated by replacing them with a single Convert function. It would behave exactly the same as the current keywords.

Declare d As Double = 3.14 
Declare x As Integer = Convert(d, Integer) 

-- but also allowing --

Declare x As Integer = Convert(d) 

My proposed syntax is a little more verbose, but eliminates 15 keywords, some of which can be hard to remember. It also supports implicit detection of the desired type if possible, as in the example above where it can see that the target of the Convert is an Integer, and saves you the trouble of explicitly stating that.

I would also allow a third parameter to Convert for specifying the type converted from. This can be useful when the source type is not always obvious, and when you want a warning when source type is refactored.

Declare x As UInt32 = Convert(GetAge(), From := UInt64) 

This takes advantage of current named argument syntax to allow both implicit detection of the To type and explicit specification of the From type. If the GetAge function is someday updated to return a Int64, then the code would no longer compile, preventing a possible bug when dealing with negative values.

DirectCast too verbose

This keyword is unnecessarilly verbose. Fortunately there is already an intuitive and logical replacement already in the language.

Declare x As Object = GetFoo() 
Declare y As Foo 
y = x As Foo 

The As keyword is already used to declare the type of an object, and the above syntax should be obvious in meaning. As with DirectCast, if x is not a Foo, then an exception would be thrown.

It would also be nice to allow this syntax to be used when it’s known that the source and destination types differ. In this case, “As” would be equivalent to Convert.

Declare d As Double = 3.14 
Declare x As Integer = d As Integer 
Eliminate TryCast

The use of TryCast doesn’t really add any value. Instead of:

Declare x As Foo = TryCast(y, Foo) 
If x Is Nothing Then 
  x.Blah() 
End If 

you could just use:

If y IsA Foo Then 
  Declare x As Foo = y As Foo 
  x.Blah() 
End If 

The compiler should be able to figure out how to make the latter syntax just as efficient as the former.

No need for CType

There should be no need for this keyword, which is kind of a hybrid between casting and conversion. This should be replaced by either As or Convert. The use of CType for defining conversion operators on a class would use the keyword Convert instead.

Nothing vs. Null

The VB concept of Nothing is almost always referred to as Null in other languages, and even within much of the VB community (Probably because of SQL). Having a separate term for such a common concept is just confusing for users old and new.

Declare x As Foo = Null 
Assert(x IsNot Null) 
Assert(x = Null) 
No need for AddressOf keyword

Nowhere else in VB is the notion of pointer or address exposed, and it’s likely that AddressOf doesn’t really return an address anyway. We could just use the Function or Sub keyword in place of AddressOf.

From the examples in AddressOf documentation:

AddHandler Button1.Click, Sub Button1_Click 
Declare t As New Thread(Sub CountSheep) 
Add the ability to Define aliases

The C++ typedef keyword is often useful for writing maintainable code, and sometimes C++ references are handy in situations other than parameter passing. I propose that a new Define keyword could provide this functionality for VB in an intuitive way.

It can allow you to enhance readability. For example, by defining a new IntList type we save typing and enhance readability whereever List(Of Int) is used. This becomes more valuable with more complex generic types.

Define IntList As List(Of Integer) 
Declare v As IntList 

It can also be useful for changing a type without having to update all the code, or to allow conditional compilation to use different types.

#If SafetyEnabled Then 
  Define MyInt As SafeInteger 
#Else 
  Define MyInt As Integer 
#End If 

Define is already partially available with Imports. I would remove the ability for Imports to define new aliases and use the Define syntax instead.

Imports Con = System.Console 

becomes

Define Con As System.Console 

This keyword could also be useful for defining aliases for other things besides Types.

Sub Foo 
  Declare aReallyReallyLongName As SomeType = GetSomeType() 
  Define s As aReallyReallyLongName 
  s.SomeMethod() 
  Define sm As s.SomeReallyLongMethodName 
  s.sm() 
End Sub 
More flexible Imports

I’d like to Import a symbol within small scope. I see no reason why this couldn’t be allowed anywhere instead of only at the top of a file.

Sub Log() 
  Imports System.Console 
  Write(a) 
  Write(b) 
  WriteLine(c) 
End Sub 
No need for Delegate keyword

We should just be able to define a delegate type like any other.

Define Sub OnClick(ByVal e As EventArgs) 
AddHandler Button1.Click, Sub OnClick 
Public Event As OnClick 
Remove Alias, Ansi, Auto, Declare, Lib, and Unicode

These keywords for declaring access to external functions make it marginally easier to access external Win32 routines, but cause unnecessary clutter in the language, making it that much harder to understand. They are also unnecessary, because it’s not much harder to just use the DllImport attribute.

Consider removing keywords for basic types

The CLR already defines nice unambigous names for all the basic types such as Byte, Integer, Short, Long, etc. It might be better just to use those, and eliminate the unnecessary keywords from VB. Most of the VB keywords are the same as the names defined in System anyway, so this wouldn’t be that onerous. The proposed Define feature could even provide easy compatibility for old code.

No need for the Call keyword

In VB you can call a subroutine by saying “Call Foo”, but you can also get exactly the same behavior by saying “Foo()”. I don’t see any benefit to the former syntax, and it adds yet another unnecessary keyword to the language.

Select instead of “Select Case”

There’s no need to support an optional Case keyword in the first line of a Select statement. We should just standardize on the following syntax, and eliminate a little more clutter from the language.

Select age ' vs. Select Case age
Case 0 To 20 
  Foo1() 
Case 21 
  Foo2() 
Case 25, > 30 
  Foo3() 
Case Else 
  Foo4() 
End Select 
Literals

There are currently some basic types that have no syntax for literal declaration such as Byte. There are also some type such as Char where the literal syntax is unintuitive. A better character literal might use the back-tick (e.g. `c`).

Eliminate the colon

One of the strengths of VB is its readable line-oriented syntax. The colon operator allows you to subvert this. Eliminating this would help keep VB code readable to all VB programmers.

The other use for the colon is to declare labels for GoTo statements, but I’m also proposing elimination of GoTo for VB.

Eliminate the underscore

Similarly the underscore is a crutch that should not be needed. Code should either rely on wrapping in the IDE, or introduction of temporary variables to make code more readable.

However, to make this work, we need to eliminate some of the most egregious causes of lengthy lines. The UsesAttribute keyword fixes one such area, new syntax for Implementing interfaces fixes another, and the proposed Define keyword another.

Bring Back the underscore

Instead of using the underscore for line continuation, I would use it to disambiguate VB keywords that match symbols. Currently the [] brackets are used for this purpose, but I think those work better for array access as stated previously, and prepending an underscore seems a more intuitive solution. Hopefully the elimination of many keywords will make the need for disambiguation much less prevalent.

Remove the \ symbol

Although integer division is probably fairly frequently used, I find the inclusion of two separate division symbols confusing. Integer / Integer, should leave the arguments as integers, while Int / Float and Float / Div Int should convert the result to a floating point type. Anything else can be handled by explicit conversions and/or separate math library functions. It would also be helpful if the IDE would colorize integer types differently than floating point.

Eliminate If … Then … Else … Statements

One of the primary values of VB is its line-oriented nature. I was completely unaware that the current VB allows If Then Else on a single line without an End If. This should not be allowed, as it subverts the nature of VB, by allowing a tiny syntactic convenience with far greater potential for misuse than valid use cases.

For example, this currently compiles, and I don’t think it should.

If x > 0 Then Return “>” Else Return “<=” 
Introduce a short-circuiting ternary expression

One of the problems with the C ternary expression is that the default case is separated from the context by the boolean expression. For example, if I want to Trim() only a non-null string in C#:

String trimmed = s != null ? s.Trim() : “”; 

Notice how the code we really want to write “String trimmed = s.Trim();” is very different from the code we have to write to handle the null.

I think VB has the opportunity to make this common expression more readable by reordering the “arguments” to allow:

String trimmed = s.Trim() ? s != null : “”;

Or in VB:

Declare trimmed As String = s.Trim() IIf s IsNot Nothing Else “”; 

Other examples:

If a = (b IIf b < 5 Else 5) Then 
If a = (b IIf b IsNot Nothing Else GetDefault()) Then 
Foo(a, (x IIf a > 0 Else GetDefault()), z) 
y = 10 IIf x Is Nothing Else x.Foo() 
Introduce UsesAttribute

Instead of overloading the < and > operators for applying attributes, I propose we add a new keyword to make this feel more natural in VB, and to eliminate one of the major needs for the underscore line continuation character.

Sub Foo(ByVal s As String, ByVal n As Int32) 
  UsesAttribute Conditional(“A”) 
  UsesAttribute WebMethod 
  ' Body of subroutine 
End Sub 

Class SomeService 
  Inherits Foo 
  Implements Bar 
  UsesAttribute WebService(Namespace:=”blah”) 

  Sub New() 
  End Sub 

  Sub Calculate 
    UsesAttribute WebMethod 
  End Sub 
End Class
Unweildy Interface-based polymorphism

A strength of VB is its readable English keywords, but in practice this can sometimes get out of hand.

Public Interface Shape 
  Function CalculateArea(ByVal X As Double, ByVal Y As Double) As Double 
End Interface 
Public Class RightTriangleClass 
  Implements Shape 

  Function CalculateArea(ByVal X As Double, _ 
    ByVal Y As Double) As Double Implements Shape.CalculateArea 
    Return 0.5 * (X * Y) 
  End Function 
End Class 

In general, any time the underscore has to be introduced to allow a logical line to continue on the next physical line, we see an area for improvement. The following would be a much nicer syntax for the above, with the option of reverting to the really verbose syntax only when there’s a conflict, or when you want to change the name of the interface.

Public Class RightTriangleClass2 
  Implements Shape2 

  Function Implements CalculateArea(ByVal X As Double, ByVal Y As  Double) As Double 
    Return 0.5 * (X * Y) 
  End Function 
End Class 
Erase

There’s no need for this keyword, as you can get the same effect by assigning Null. The Erase statement supports a variable number of arguments, but the limited utility of this is not worth the introduction of a keyword. Instead, if this functionality is important, then it would be more useful to allow ParamArray to use ByRef making it easy to implement Erase as a user function, as well as other possibly useful functions.

ReDim and Preserve

This functionality seems easy enough to reproduce with a few small functions, possibly added to the Array class.

On, Error, Resume

These keywords are no longer necessary in a modern VB. We now have exception handling, and these just clutter the language in the name of reverse compatibility.

No public fields in Class types

By eliminating the capability to create Public members of a class type, we can use the following syntax to declare simple properties.

Class Foo 
  Public ReadOnly name As String 
  Public age As Int32 
End Class 

More complex properties could revert to the older syntax using Property. Public fields would still be allowed in value types.

No more Set or Get

“Set” is just too valuable to waste on a keyword. I’d rather have a System.Collections.Generic.Set for holding sets of objects. When declaring properties we can just take advantage of the fact that the mutator is always a Sub, while the accessor is always a Function. Then we no longer need the two keywords.

Private myName As String 
Public Property Name As String 
  Sub (ByVal s As String) 
    myName = s 
  End Sub 
  Function 
    Return myName 
  End Function 
End Property 
Goodbye Goto

Let’s be the first language to discard goto. CS programs usually go to great lengths to teach students that goto is almost always the wrong solution. The variety of languages available on the .NET platform means that some languages like VB and C# can eliminate goto, while others (C++?) could keep it.

Generic RAII scope

The Using statement is one useful way to ensure that object are disposed correctly, but sometimes it can be cumbersome, and can lead to unnecessarilly deep nesting. It would be useful to introduce Scoped variables to handle this more cleanly in some cases.

Sub Test 
  If blah Then 
    Scoped x, y, z As Connection 
    ' x, y, aand z are disposed here 
  End If 
  If blahblah Then 
    Scoped lock1 As Mutex = … 
    Scoped lock2 As Mutex = … 
    … 
    ' lock2, lock3, … are released 
  End If 
End Sub 
Eliminate SyncLock

More general Scoped and Using keywords can replace the need for this.

Modules, Namespace, and Shared Class members

I find it a little confusing that all these exist. Maybe this could be simplified by allowing global functions in a Namespace as an alternative to Modules, or eliminating the Namespace keyword entirely and using the Module keyword for this concept.

No Next

It would make VB easier to understand and more consistent if “Loop” were used to end For Loops. This is consistent with the other loop variants, and makes it clearer what’s happenning.

Eliminate While Loops

Do While … Loop should be sufficient without the need for While…End While. This doesn’t actually save keywords, but simplifies the language by eliminating an unnecessary and more verbose syntax for while loops. The IDE could allow leaving off the Do keyword, and filling it in automatically.

Simplify Continue and Exit

Continue is followed by Do, While, or For to indicate which type of loop to continue. This added flexibility is unnecessary, and Continue should just jump to the top of the current loop.

For I = 1 to 10 
  Do While blah 
    … 
    If x Then 
      Continue 
    End If 
  Loop 
Loop 

The above, would simply restart the Do While loop when x is True. If you really need to restart at the beginning of the For loop then the code can be refactored into multiple methods or some other approach.

Similarly, only “Exit Loop” should be supported in the future. Exiting a sub, function, or property is accomplished with “Return”, and exiting a Select is unnecessary.

ElseIf is inconsistent

"For Each" and "End If" use two separate words, so Else If should also.

Declare x As Object

It would be nice if the “As Object” part of the expression were optional even when Option Strict and Option Explicit are enabled. The default value for any type would be Object. So I would advocate:

Declare x

More flexible Options

As fancy new features such as Lambda Expressions and Closures are added to the language, we may want more flexible control over the options. For example, we may want to turn Strict and Explicit off for Lambdas, while leaving them on for everything else. We might also want finer control of the scope for these options, perhaps turning off Option Strict within the scope of a single Function.

Rather than have keywords for these two statements, just introduce new VB attributes.

StrictMethods(on|off) – controls whether late binding is allowed

StrictConversions(on|off) – controls whether Int64 auto-converts to Int32

ExplicitLambdas(on|off) – controls whether lambda expressions require types

etc.

Consider eliminating or extending With

By allowing an empty with statement we can introduce an artificial scope, which can sometimes be useful to avoid having too many variables in scope. Usually it’s better to split such a method into separate subroutines, but this can sometimes make an algorithm harder to understand.

With 
  Declare x As IntegerEnd With 

' x is not in scope

It would also be useful to assign an alias for the With object.

Sub Foo() 
  With f = GetEmployee.Name.FirstName 
    f.blah() 
    Log(f) 
  End With 
End Sub 

However, this is not so necessary with the proposed Define keyword:

Sub Foo() 
  Define f As GetEmployee.Name.FirstName 
  f.blah() 
  Log(f) 
End Sub 
RaiseEvent unnecessary

This keyword is similar to Call, and is also unnecesary, when the obvious syntax of calling the event as a function would be more intuitive.

Event LogonCompleted(ByVal UserName As String) 
Sub Logon 
  LogonCompleted(“Justin”) 
End Sub 
AddHandler, RemoveHandler

Using language keywords for these is confusing. The C# syntax might be fine, or VB could automatically allow any Event/Delegate to have implicit Add/RemoveHandler methods. Note that += should NOT be used, because we’re appending to the list of handlers, therefore &= concatentation is more appropriate.

Public Event LogonCompleted(ByVal UserName As String) 
Sub OnLogon(ByVal x As String) 
End Sub 
Sub New 
  LogonCompleted &= Sub OnLogon 
  LogonCompleted.AddHandler(Sub OnLogon) 
End Sub 
No more REM, and add comment blocks

There’s no point in having two mechanism for line comments, although it might be useful to introduce a multiline comment.

‘’’ Anything between here

and here

is a comment

‘’’

No more implicit return variable for functions

Having two ways to return a value from a function just adds complexity to the language with very little benefit. At best it saves a line of code or two to declare the return value and return it, but often modern programs don’t (and shouldn’t) declare a single return value anyway.

Rename Shadows

The description for this feature, provides the answer for what the keyword should be. “Specifies that a declared programming element redeclares and hides an identically named element, or set of overloaded elements, in a base class.” By renaming the keyword to Hides, it becomes more accessible to a new or infrequent VB programmer.

Hides should also be useable to explicitly state desired behavior.

Sub Foo 
  Declare I As Integer 
  If blah Then 
    Hides I As String 
  End If 
End Sub 
No End or Stop

The Stop keyword should be removed from the language. There’s no need for this to be a language-specific feature, as it’s easy enough to use the portable System.Diagnostics.Debugger.Break() method.

The End keyword (that halts the process) should also be removed for similar reasons. Normally, you would either let an uncaught exception percolate up and kill the process, or you could call Process.Kill() possibly followed by Process.WaitForExit().

Allow automatic assignment of Arrays to variables

Python has some useful syntax for working with Tuples. VB could add some of the same convenient syntax for working with arrays.

Allow automatically assigning array values to variables. The a,b, and c variables would be filled with the first three values from the array.

Function Foo() As String() 
  … 
End Function 
Declare a, b, c As Integer = Foo() 

Also, allow more flexibility for array initializers.

Declare x, y, z As Integer = 1, 2, 3 
Add Slices

Python has the useful ability to access a subset of any List, String, or Tuple. VB could support similar syntax without complicating the language much.

Declare s As String = “Hello World” 
s[0] equals ‘H’ 
s[0 To 2] equals “H 
s[-1] equals ‘d’ 
s[-4 To -2] equals “orl” 

etc.

Remove Mid

By allowing slices to be assigned, you make a more flexible Mid that is also more intuitive to read.

Eliminate need for GetChar

A single element slice should automatically return a Char instead of a one element string. The new `a` character literal should be equivalent to “a”[0].

No need for { } array initializers

VB should be able to figure this out without the braces, and should be able to parentheses to make the association explicit where it would be ambiguous.

Declare x, y, z As Integer = 1, 2, 3 
Declare x[] As Integer = 1, 2, 3 
Declare x[][] As Integer = (1, 2, 3), (1, 2, 3) 
Declare x[2, 2] As Integer = (1, 2, 3), (1, 2, 3) 
Sub Foo(ByVal a() As Int32, ByVal b As Integer) 
End Sub
Foo((1, 2, 3), 4) 
Array Bounds vs Last index confusion

I still find the syntax choice for array declarations confusing, and it doesn’t help much that the To keyword is supported. (Unless the IDE were to add it automatically.)

Declare x(5) As Integer should create a 5 element array with a lower bound of zero.

This use of the To keyword should be removed again, because allowing flexible lower bounds just makes arrays more confusing to work with. For those who want or need explicit lower bounds, a generic BoundedArray class could be provided.

Automatic ToString when using concatenation

One of the benefits of having separate unambiguous concatenation operator ‘&’ in VB, is that it can automatically convert its arguments to strings.

Declare I As Integer 
Declare f As New Foo 
f & I & “blah” 
I & f & “blah” 

This is currently possible only by explicitly implementing the concatentation operator for Foo, but it requires providing an implementation for every possible valuetype, or inefficiently overriding for just Object, forcing boxing. This could be partially helped by allowing generic operator overloading, but that would still be far too tedious for the common case. It might even be desirable to remove operator overloading for the ‘&’ operator, and instead forcing that to always represent string concatenation and be handled automatically.

Separate operator for assignment and equality

One of the keys to VB’s ease of use is that it doesn’t allow assignment statements to return a value. Unlike other languages, you can’t do:

Declare a, b, c As Integer 
a = b = c = 5 

One result is that VB can always tell from the context whether ‘=’ is meant as an assignment or a comparison. This is actually a valuable feature, because it saves typing and prevents a common error from other languages.

What I propose is that we use the := operator that is already used for named arguments as the only allowed assignment operator, and have the IDE automatically change ‘=’ into ‘:=’ as it’s parsed in the same way that it automatically changes “Endif” into “End If” and other similar transformations. One benefit is that we get immediate visual feedback that VB understood what we meant when it makes the transformation.

For Loop should iterate only over the specified range

The following expression currently increments M, then checks to see if it’s less than 5.

For M = 0 To 5 

This is confusing given the VB syntax, because the syntax seems to imply that “M” will only take on the values 0, 1, 2, 3, 4 and 5, but it will actually be 6 after the loop.

Besides being confusing, it can easily lead to infinite loops if the incremented variable overflows. If overflow checking is enabled, then instead of an infinite loop, you get an exception. Either is unwarranted.

Eliminate MyClass

I find this keyword hard to remember, and I’d just like to be able to use the name of my class to be explicit when necessary.

Class Foo 
  Protected Overridable Sub Bar() 
    ... 
  End Sub 
  Public Sub CallMyBar() 
    Foo.Bar() ' calls our own Bar 
    Bar() ' calls the bar of our most derived class 
  End Sub 
End Class 
Allow alternate spellings for Overridable, NotOverridable

It would be consistent with the English language to also allow Overrideable and NotOverrideable as synonyms.

Eliminate Type Characters

Old style code that used type characters should no longer be supported.

Function StrFunc(ByVal x&, ByVal y$, ByVal z#) 

They also shouldn’t be allowed for literals.

Literal DateTime values should still be supported however, because they’re useful, and easier to read than alternatives.

Declare d As DateTime = #11/11/1970 13:32# 
Literal Values for every type
tmp = 42    ' Int32 
tmp = 42UI  ' Unsigned Int32 
tmp = 42L   ' Int64 
tmp = 42UL  ' Unsigned Int64 
tmp = 42S   ' Int16 
tmp = 42US  ' Unsigned Int16 
tmp = 42B   ' Byte 
tmp = 42SB  ' Signed Byte 
tmp = 42.5F ' Float32 
tmp = 42.5  ' Float64 

Currently, only L, UL, and UI are supported

Eliminate Assembly, Option Compare, Binary, and Text Keywords

We shouldn’t have to use language keywords (even unreserved ones) to change these behaviors. The IDE or command line tools should be able to handle it when necessary, or UsesAttribute.

Option Compare may be completely unnecessary, as String already provides explicit control.

Add missing Volatile keyword

This should have the same meaning as other languages.

GoSub, Let, and Wend

These have been deprecated long enough and can all be eliminated if making the kind of sweeping changes I’m advocating.

Bring Back Variant

Option Strict is currently used for many related features, such as implicit type conversions and duck typing. However, it might be even better to reintroduce the Variant keyword to allow duck typing for individual identifiers. This would work even with Option Strict enabled.

No VB-specific extension libraries by default

I’d like to be able to write portable code that doesn’t rely on VB-specific libraries. I don’t care about access to legacy APIs that are currently in the Microsoft.VisualBasic namespace. Any features, such as the Convert function that are needed should be available without having to pull in other baggage as well.

Lambda Expressions

This feature is partially planned for the next release, but I’m afraid it’s going to allow code that is too cryptic and terse for Basic.

The proposed syntax is:  

inc = Function(y) y + 1

Which creates an anonymous single-parameter function that returns its parameter incremented by 1. This really doesn’t feel like VB, as a) It’s not line-oriented, and b) it implicitly returns y+1.

Even though it would prevent some common lambda use-cases, I’d prefer VB to only allow lambda expressions as variable initialization, and requiring multiple lines.

inc = Function(y) 
        Return y + 1 
      End Function 

This is only a fairly minor extension to existing Function and Sub usages. The above would only be allowed with Option Strict and Explicit disabled. Otherwise you’d get the usual:

Dim inc As Function(ByVal y As Integer) As Integer 
             Return y + 1 
           End Function 
Must Intrinsic Functions Be Keywords?

Several VB features involve mapping syntax that looks like a function call to CLR features. For example, Convert(x) becomes an appropriate ILAsm conversion function for the target assignment. If none of these had to be reserved as keywords, then it would simplify the language quite a bit. Assuming all the intrinsic function are in a suitable namespace, the usual namespace disimbiguation rules would apply.

If these keywords can become intrinsic functions in a namespace, then more useful variations could be added such as explicit SafeConvert and UnsafeConvert functions that map to the various ILAsm type conversion features.

Support Unmanaged Code

I’d like to be able to use VB syntax to write lower level code. Instead of dropping in to C++, I’d like to forgo garbage collection, and other fancy features to write small efficient code without giving up the readability of VB. A separate smaller set of System namespace libraries could be provided to give me everything I need to write device drivers, protocol stacks, and programs that need absolutely minimal footprint.

Summary

In summary the above proposals would eliminate 88 of the roughly 200 existing keywords while only adding 12 new ones. It would also eliminate some overloaded uses of current keywords, and introduce several powerful new features from other languages. The cost of all this is reducted compatibility with existing programs, eliminating the ability to simply compile. If this isn’t desirable, then a new Basic-style languages could be introduced. I believe the advantages of a much simpler yet more powerful language are worth it for writing new programs, and possibly even porting existing ones.

Although the language would retain approximately 120 keywords after these changes, the remaining ones seemed to make for more readable code than alternatives from other languages. Many of the remaining words can only be used as part of compound word key words, as with “Each” which can only be used in a “For Each” statement.

New Keywords (13)

[], Declare, Null, IsA, Hides, Scoped, Volatile, ''', UsesAttribute, Define, ‘_’, Overrideable, NotOverrideable

Removed Keywords (86)

{}, CBool, CChar, CDec, CDbl, CInt, CLng, CObj, CSByte, CShort, CSng, CStr, CUInt, CULng, CUShort, CDate, CType, DirectCast, AddHandler, RemoveHandler, AddressOf, Alias, Declare, Ansi, Unicode, Lib, Dim, Object, Byte, Char, String, Integer, Short, Double, Single, Long, UInteger, ULong, UShort, SByte, Decimal, Date, Call, Declare, Delegate, Erase, ReDim, Preserve, Error, Resume, GetType, GoTo, Let, GoSub, Wend, Namespace (Or Module), Next, Nothing, Option, Explicit, Strict, Compare, Text, Binary, On, Off, RaiseEvent, REM, Shadows, Stop, SyncLock, TryCast, TypeOf, Assembly, IsTrue, IsFalse, Mid, ':', '_', ‘%’, ‘@’, ‘!’, ‘$’, Set, Get

IDE Features

Another problem with the current VB is some very annoying or broken IDE functionality.

Case Fixing Broken

VB is supposed to update the case of symbols to match their declaration, but this is currently broken in several instances. For example, when I type “imports system.data\n” at the top of the file, it should automatically change it to “Imports System.Data\n”.

A concerted effort should be made to fix this in the various places where it doesn’t currently work, and additionally a simple Ctrl-K, Ctrl-D should fix up the whole document.

Indentation is broken

If you type “if blah then\n” then VB will automatically indent to the proper position on the next line. However, if you move the cursor to a different line, and then go back, it will no longer be at the correct position. This wouldn’t be too bad if the Tab key would take you to the correct position, but it does not.

Improve syntax recognition

The VB IDE used to be better about recognizing syntax as I type. For example, if I were to type “for<space>each<space>” on a line, then the IDE would capitalize “For” as soon as I hit the first <space> key. Currently, the IDE only colorizes “for” which tells me that it does recognize the keyword, but it doesn’t fix capitalization until the whole line is entered. I do think that most syntax errors shouldn’t be highlighted and indentation shouldn’t be changed until the line is finished, but case-fixing, coloring, and other formatting changes should happen as soon as possible. This used to be one of the best things about working with VB.

Tab key shouldn’t indent

It should be possible to redefine the tab key to always indent the current line to the correct position, and NEVER to actually insert indentation. Basically it should be have the same as Emacs Tab, but without having to choose Emacs emulation.

Allow assigning macros to tab

In previous versions of VisualStudio I was easily able to implement my own EmacsTab using a macro, and I was able to bind this to the tab key. This will require cleaning up the Code Snippets feature to not be hard-coded to use the Tab key for moving between editable fields.

Macros Too Difficult

A macro should be able to programmatically duplicate anything I can do using the keyboard. I've tried repeatedly to write a macro that would duplicate the functionality of the Emacs tab key, but I haven't been able to get what I want. This macro took me only a few minutes to figure out with Visual C++ 6. In general the macro API is just too difficult-to-use for the amount of time have to devote to customization. Maybe a simplified facade should be provided. 

More Assistance

VB should be more helpful by more often allowing me not to type long keywords such as NotInheritable. For example, it could let me type the shorter text “not”, and then automatically expand it to the real keyword based on the current context. In general the VB parser should provide more context-sensitive assistance as I type code. It should always provide immediate but non intrusive feedback that it understands what I’m typing.

Non-Intrusive Layout

Often when I’m editing code, the current VB will immediately make layout changes as I type. This can make it very confusing in practice, because the indentation could “jump around” causing me to lose track of what I was doing. In general this is an example of VB taking the viewpoint of the parser rather than providing the illusion that it’s reading my mind. If I want to surround some code with “If … Then … End If”, then it only upsets me when VB immediately matches my newly inserted If with a previously existing End If, and indents my code accordingly. It’s only a slight distraction that some previous If now has a squiggly red underline indicating that it no longer has a matching End If. The solution is to ensure that VB doesn’t “jump to conclusions” on incomplete information. If a block of code contains syntax errors, then it should not be formatted, and it probably shouldn’t even be shown as a syntax error until I do something (Ctrl-S, Ctrl-K D, etc.) to indicate that I think the code is now correct. Perhaps VB could use a squiggly beige underline to tell me that it knows the code is not yet correct, but is currently waiting to see what I’ll type next.

Customizable

I’d like some more advanced control over how the code is formatted. In addition to indentation stlye and coloring that are currently available, I’d like control over how keywords are capitalized, the ability to disallow language features, and other controls.

No More Compiling

When working with Eclipse/Java there is no concept of Compiling per se. Instead, the syntax is highlighted as you type, and certain actions like saving files or running the program will cause the compilation to happen in the background. This seems very much in the spirit of VB, and seems almost there already.

MultiFile Assemblies

It would make unit testing much better if VB allowed you to easily create unit tests in the same assembly as the tested code, but in a separate project. This allows me to expose testing interfaces as Friends. The current workaround is to just include the tests in the same project as the tested code.

Intellisense shows invalid values

Often, after typing “xyx<dot><ctrl-space>” I’ll be given a dropdown list of irrelevant information. If VB doesn’t know the actual members of the xyz object then it shouldn’t provide a list at all.

Automatic Imports

One extremely nice feature of Eclipse/Java is that it can automatically figure out the imports. If I type in “Dim sw As StringWriter<ctrl-space>” then VB should prompt me with a list of namespaces that have StringWriter types. If there is only one possible choice then it should not even prompt. Additionally it should insert the necessay Imports statement at the top of the file.

A corresponding feature from Eclipse/Java is the Ctrl-Shift-O keystroke which optimizes Imports. This automatically removes Imports that are no longer needed, and attempts to add any that are necessary. VB would require a new syntax for only importing selected types from a namespace.

Code Snippets

Several things about this feature really annoy me. Foremost is that the replacment fields stay active till the file is closed. Maybe it wouldn’t bother me as much if it was formatted differently, such as a solid underline that only highlighted when the field was the active one. The ugly green color is enough to prevent me from using this feature. The workaround I’ve decided on is to make the background color white. This makes the code snippet fields invisible unless the cursor is inside the field. Still, I’d like to be able to hit Enter (or maybe Ctrl-Enter) and have the fields finalized as if I’d closed and reopened the file.

Eliminate My

I’ve never liked the My feature very much. It seems like it could be replaced with the ability to assign duplicate Imports aliases. This would also allow me to pick a different name than “My”, and better control exactly what is included. For example, I might do the following:

Imports Foo As System.Text.RegularExpressions 
Imports Foo As Microsoft.VisualBasic.Devices 
Imports Foo As Microsoft.VisualBasic.FileIO 

This would combine all the listed namespaces into a single Foo namespace. Any ambiguities would have to be resolved explicitly as usual. It would also work with classes, allowing Shared methods to be called directly as usual.

Summary

Even though the list of things I would change about VB seems longer than my previous post about the things I like about VB, that shouldn’t give you the impression that I dislike the language. On the contrary I think it’s the best choice available on the .NET platform for most problems. C++/CLI provides more control, and the unique ability to comingle unmanaged and managed code. C# is similar to Java, more portable (Mono Linux), and has a few features that haven’t yet made it into VB. (yield, anonymous methods, unchecked, etc.) However, VB is the easiest to read and type, has the best editor, and is generally easier to work with for many problems.

January 17

Why VB is Best

As I said in my previous post, I believe that Visual Basic is the best language for many types of applications. It provides all the power of C# and Java, but has arguably better tools and syntax. For now, I’ll concentrate on the positive, but in a later post I’ll discuss what I feel are VB’s shortcomings, and my suggestions for addressing them.

For me, the most valuable VB language features are its Readable English Keywords, Line-Orientation, and Case-Insensitivity. The ability to quickly create GUI applications using drag-and-drop may have been the original selling point, but to me what makes VB great is the language. It’s currently my preferred tool for working on personal projects.

Readable English Keywords

For the most part, VB is a very easy language to read, and the keywords chosen come closer to conveying the correct meaning than with other languages.

Many languages have a long list of special symbols that the user is required to memorize before the language is readable. Often the same symbol will mean completely different things depending on the context. While VB has the commonly used Math symbols, and even a separate concatenation symbol, overall its philosophy is to favor the use of keywords to help the user to understand the code. (As you’ll see later, I even think some of the current symbols should be removed.)

The best VB keywords are those that help the reader to understand what’s going on without having to be fluent in VB or any other OO language. This helps beginners to learn the concepts, and also helps experienced programmers write more maintainable code.

Examples

Inherits, NotInheritable, MustInherit

The VB syntax for object derivation is very clear and explicit as long as you’re familiar with the concept of inheritance in object oriented languages.

Class Foo 
  Inherits Bar 
End Class

Furthermore the NotInheritable keyword is clearly related, and has the obvious meaning.

NotInheritable Class Bar 
End Class

This would cause the Foo class above to have a compilation error.

The MustInherit keyword is also clearly related to the other two.

MustInherit Class Bar 
End Class

Even someone new to the language should be able to guess that this prevents the Bar class from being instantiated on its own.

If you contrast the above with the corresponding concepts in Java (extends, final, and abstract) and C# (‘:’, sealed, abstract), it should be clear that the VB keywords are superior. Although Extends, NotExtendable, and MustExtend would work almost as well, the term “inheritance” for the whole concept is usually used even in Java circles.

Overrides, Overridable, MustOverride, NotOverridable

Similarly, the concept of “virtual” is overused in computer science, and therefore a poor choice for describing class methods that can be overridden in derived classes. The VB keywords have a more obvious meaning, and may even make OO polymorphism easier to grasp for beginners.

The equivalent Java ( [virtual], virtual, abstract, and final), and C# (overrides, virtual, abstract, and sealed) are not as clear and don’t work as well together.

Shared, Static

VB uses the keyword Shared for methods and fields that apply at the class level. The C#, Java, C++ keyword static seems a little less clear to me, and gets muddled up with the separate concept for local data within a function that lives past the lifetime of the function. All four languages use static to describe this concept.

Class Foo

Public Shared Function CreateFoo() As Foo

End Function

Private Sub Bar

Static called As Integer = 0

called += 1

End Sub

End Class

And, Or, Not

Using these simple keywords instead of the common &&, !, and || is more approachable and actually easier to type, because the latter require shift-key combinations. The C-style symbols are also fairly arbitrary and make C-style languages just a little harder to learn and use.

ByRef, ByVal

Like C#, these do require you to understand the concepts of passing by value versus passing by reference, but I find the VB keyword a little clearer. Allowing the programmer to be explicit about ByVal also enhances readability, because I no longer have to remember which is the default. (It’s ByVal.)

Do, While, Until, Loop

I find the VB loop syntax to be flexible enough for any purpose, and much more intuitive than Java, C#, or C++. All the following are allowed.

Do Until x 
Loop 
Do While x 
Loop 
Do 
Loop Until x 
Do 
Loop While x

You can also Continue or Exit a loop explicitly.

For, Each, In

I think the VB iteration syntax is nearly perfect.

For Each x In y

is not so different from C#:

foreach (x in iterable)

but the required parentheses seem to break up the statement in a weird place.

Java is worse, because it uses an arbitrary ‘:’ symbol that you have to train yourself to read as “in”.

for (x : iterable)
For, To, Step

The counted loop is also fairly simple, and I think much easier to read than C#, Java, C++ equivalents.

For c = 0 To 9 
For n = 1 To 20 Step 2

as compared to :

for (c = 0; c < 10; ++c) 
for (n = 1; n <= 20; n += 2)
Summary

The above is certainly not an exhaustive list of the well-named features in VB, and I certainly don’t imply that all VB keywords have good names. (Later, I’ll explore in more detail just which keywords I think should be renamed, and I even advocate removing almost 90 keywords.) However, overall I find most of the VB keywords to be more intuitive than alternative languages, especially for those who don’t have preconceived notions of what a concept should be named. (e.g. virtual) This makes the language more approachable for beginners, but also easier for experienced programmers.

Line Oriented

Another major feature of VB is its preference for having one statement per line. This is a common practice by many programmers in Java, C#, and C++ as well, but VB has features to encourage the practice.

No Semi-colon necessary

In C-based languages you are required to put a semi-colon at the end of each statement, and an end-of-line has no special meaning. This allows a single statement to span multiple lines, and multiple statements to reside on a single line. I find that both of these practices tend to make code difficult to read. Tools such as Eclipse/Java have built-in code formatters to ensure that each statement resides on a single line, and to ensure that statements that span multiple lines have a consistent format.

In VB, the default is for each statement to be terminated at the end of the line unless an underscore ‘_’ is placed after the statement, allowing a single statement to span multiple lines. VB also allows a colon ‘:’ between statements to force multiple statements on one line. As you’ll see later, I’d actually like these exceptions removed to force the programmer to find a more readable line-oriented style.

Assignment Side Effects

In VB, assignment is a statement, not an expression, and has no return value.

In C++ terms, the VB assignment operator is written like this:

void operator=(T lhs, T rhs);

If I have an example, java program like this:

if (x == 1000) { 
  x = y = z = 0; 
}

Then in VB, I must write it as:

If x = 1000 Then 
  x = 0 
  y = 0 
  z = 0 
End If

Alternatively, I could use the colon:

If x = 1000 Then 
  x = 0: y = 0: z = 0 
End If

One other side-effect is that VB is able to disambiguate assignment from equality checking. The common problem of typing “if (x = y)” instead of “if (x == y)” in C-based languages is eliminated, because “If x = y Then” always refers to equality checking and “x = y” always refers to assignment. (Later, I discuss why I think VB should automatically display a different operator for assignment.)

Syntax Validation

When you press the Enter key, VB knows that the statement is likely complete, and usually does syntax validation as well as possible automatic syntax cleanup. For example, if you type the first line of a multi-line expression such as “If x Then”, VB will automatically insert a closing “End If”, and put the insertion point at the correct spot, formatting your code as necessary. In practice this makes it feel as if VB is really aware of what the programmer is trying to do, and the IDE is an active partner in code construction. Other tools have some of these features, but I’ve never seen another tool be as unintrusive and natural about it as VB, although with VB.NET some things aren’t quite as smooth as in the past, which I’ll discuss later.

Case Insensitive

It’s pretty well known that VB is a case-insensitive language, but some may not realize just how much of a benefit this is to the programmer.

Faster Typing

Despite its propensity for fairly verbose keywords, I’ve always found that actually typing in vb programs is often faster than with other languages. This is partly because you rarely need to use two-key combinations. The IDE will automatically capitalize symbols to match the declaration. My educated guess is that VB will be 10-20% faster to type than an equivalent language like C# or Java. The key is to allow the IDE to share the load, and when switching to VB from another language I usually need some time to train myself not to capitalize symbols, add unnecessary parentheses, and to use caps-lock effectively.

Here’s an example. Type in:

mustinherit class Foo<Enter>public mustoverride function Count as integer<Enter><Delete Line><End><Enter>class Bar<Enter>inherits foo<Enter><Down><Down>return 0<Down>

And what you will see is:

MustInherit Class Foo 
  Public MustOverride Function Count() As Integer 
End Class 

Class Bar 
  Inherits Foo 
  Public Overrides Function Count() As Integer 
    Return 0 
  End Function 
End Class

Only 3 two-key combinations are required for the above in VB. (The initial declarations of Foo, Count, and Bar.) For comparison, the same code requires at least 16 two-key combinations in C#. Furthermore, the VB code doesn’t require cleanup for formatting and indentation, making it even faster to write equivalent code.

Better Naming

By disallowing multiple symbols with names differing only by case, VB forces you to pick less ambiguous names. This leads to better readability than common code in other languages. For example, I’ve often seen code like the following:

Foo foo = new Foo();

If you read this aloud then both “foo” symbols sound exactly alike (which is the definition of less-readable).

To make the best use of VB, you should also pick a naming convention that works well. For example, by always prepending or appending an underscore to private fields, you are increasing the number of two-key combinations required. A better convention is to prepend “my”  or "m" to private fields, and to use camel-case for all other names.

Other

Concatenation Operator

Using the same operator for addition and concatenation leads to less readable code. By having a separate concatenation operator, the following code can compile.

Dim s As String = 123 & "456"

Note: If the + operator were used instead, then this would be a compile error.

Optional Duck Typing

VB supports optional relaxing of type rules. When “Option Strict Off” is specified for a file, then you no longer have to explicitly specify the type when declaring variables. VB will also automatically convert types. For example:

Console.WriteLine(123 + “456”) 

will print “579”, not “123456”.

VB will also allow access to any public member of a type as long as it can be found at runtime. This is what’s known in Python as “Duck Typing”, because "If it walks like a duck and quacks like a duck, it must be a duck". VB is the only language I know which allows selectively enabling this feature.

Optional Explicit Declaration

Completely separate from the question of type safety is whether to require variable declarations. Even when duck typing is desired, often it’s nice to still enforce variable declaration. This can eliminate some common typos, but can be selectively disabled just as with type safety.

Declarative Events

VB supports the same events and delegates as C#, but also provides the ability to specify event handlers in the signature of a method.

Class Foo 
  WithEvents myCon As Bar.Connection 
  
  Sub OnConnected() Handles myCon.ConnectedEvent 
  End Sub 
End Class

In this example, the OnConnected method is automatically “wired-up” to the myCon Connection object. This can often be more readable than the alternatives.

Reference Parameters

Occasionally it’s useful to be able to pass arguments by reference. This is supported by most (all?) .NET languages.

Sub ChangeTwoInts(ByRef n As Int32, ByRef m As Int32) 
  n += 1 
  m += 2 
End Sub
Named Parameters

VB allows specifying explicit names for parameters. One place this can be useful is when calling a function that take a boolean.

Public Sub DoSomething(ByVal isFinished As Boolean) 
End Sub 
b.DoSomething(isFinished:=True)

This makes the code much more readable, and eliminates the need for a comment or temporary variable to clarify the code.

Optional Parameters

VB also allows optional parameters, although using this feature for public methods is discouraged because doing so makes the code non-portable. For example, C# doesn’t support code with optional parameters. Still, optional parameters can eliminate code duplication that would be required with method overloading, and shouldn’t be avoided for internal functionality or when portability isn’t important.

Parameter Arrays

VB supports creating methods that take 0 or more optional arguments of a single type. For example:

Public Sub Foo(ByVal ParamArray args() As String)

This function accepts Foo(), Foo(“a”), Foo(“a”, “b”), etc.

Powerful Select

The VB Select statement is similar to the switch statement provided in C-like languages, except that it’s more flexible.

Here are examples that use the advanced VB features.

Select Case age 
Case 0, 5, 10 
  Console.WriteLine("0,5,10") 
Case Is > 50 
  Console.WriteLine(">50") 
Case Is < 10 
  Console.WriteLine("<10") 
Case 16 To 19, Is > 30 
  Console.WriteLine("16-19 or > 30") 
Case Else 
  Console.WriteLine("else") 
End Select 

Dim name As String = "Justin" 
Select Case name 
Case "Abe" To "Barney" 
  Console.WriteLine("a-b") 
Case "Boris" To "Kevin" 
  Console.WriteLine("b-k") 
Case "Kevin" To "Vincent" 
  Console.WriteLine("k-v") 
Case Else 
  Console.WriteLine("else") 
End Select
Full Featured Exception Catching

The .NET framework supports an optional filter for catch statements. VB exposes this functionality allowing you to only catch exceptions when some criteria is met. For example:

Try 
  ' do something 
Catch ex As Exception When LoggingIsEnabled 
  Log(ex) 
End Try
Refactoring

VB has some basic Refactoring support built in, and a license for Refactor! which provides even more. I’m generally not a big fan of refactoring tools, but the VB stuff is pretty intuitive and unintrusive. The important ability to rename things is what I use most.

Powerful Imports

VB can use the Import statement for more than just namespaces. For example, if I include “Imports System.Console” at the top of a file, then I can call “WriteLine()” directly without having to specify the Console class.

Summary

Despite the length of the above list, I’m sure I’ve forgotten some features, but maybe it will entice you to download the free Express version and try it yourself. Overall I think that VB is the best language for most programming, and it should get even better in the future. The next version will likely introduce Type Inference, Lambda Expressions, Closures, LINQ, and other enhancements. In my next post I’ll discuss a large list of new features and changes that I personally would like to see in a future version of VB.

January 16

Visual Basic

I’ve been wanting to write some posts about Visual Basic for a long time. It’s one of the reasons I finally started blogging. While I haven’t been able to use VB for much beyond personal projects for the last 8 years, I had used it pretty extensively in the past. More recently, I’ve been using VB.NET for various personal projects, and I thought I’d share some of my thoughts and experiences.

It seems that VB has always been misunderstood. Versions <= 6 have an undeserved reputation as a “toy” language, and this has carried over to some extent for VB.NET, even though it’s now almost identical to Java or C#. I feel VB was also misunderstood as a tool only suitable for database front-end applications. I actually never felt it was a very good tool for writing these types of programs. MS Access (which I suppose can be considered a VB dialect) was a far better choice, because of its improved support for bound controls and built-in reporting tools. For years VB was the most logical choice for many types of applications, but my problem was that it wasn’t terribly suitable for the kinds of programming that I prefer.

With the advent of VB for .NET, I think VB may actually be the best choice for writing many applications that have been traditionally developed in C++. The Java community has proved that it’s at least possible to write Databases, Compilers, Development Tools, and other systems programs using a similar platform, and from a purely technical perspective, I think Visual Basic is a better choice than C# or Java for many of these kinds of applications. In coming posts, I’ll detail my reasons for believing so as well as a pretty exhaustive list of what I think could/should be done to make VB even better in the future.

In my next post, I’ll discuss what I think are the best features of VB.

December 12

Lambda Expressions In Visual Basic

Paul Vick had a post last week, where he solicited feedback on syntax for this feature that will likely be included in a future version of Visual Basic.

http://www.panopticoncentral.net/archive/2006/12/08/18587.aspx#FeedBack

I originally posted my opinion in a comment on his blog, but it apparently didn't make it past the censors. He proposed 3 possible syntax variations, but I'm not found of any of them. My proposal is a fourth style, and I think it has a lot of merit.

What's a Lambda Expression?

This is probably better explained elsewhere, but basically it as an anonymous function.

For example, given the following code, ...

Class Test
  Private myNames As List(Of Name)
  ...
  Public Sub PrintLegalNames()
    For Each n As Name In myNames
      Console.WriteLine(n.ToString())
    Next
  End Sub
End Class

we could replace the explicit loop using List.ForEach.

Public Sub PrintLegalNames()
  myNames.ForEach(AddressOf WriteName)
End Sub

However, for this to work, we have to provide a suitable WriteName function.

Public Sub WriteName(ByVal n As Name) 
  Console.WriteLine(n.ToString()) 
End Sub

My proposal is to support the following syntax.

Public Sub PrintLegalNames()
  myNames.ForEach(Console.WriteLine(ByVal(0).ToString()))
End Sub

The lambda expression is just a normal expression that uses parameters and return statements inline. Here are some more examples that show multiple parameters, functions, and ByRef parameters compared to roughly equivalent code.

Dim ary(0 to 9) As Integer

Array.Find(ary, Return ByVal(0) = 2) 
Function Find2(ByVal x As Integer) As Boolean
  Return x = 2
End Function
Array.Find(ary, AddressOf Find2)


Array.ForEach(ary, ByRef(0) += 1) 
Sub AddOne(ByRef x As Integer)
  x += 1
End Sub
Array.ForEach(ary, AddressOf AddOne)


Array.Sort(ary, ByVal(1) >= ByVal(0))
Function Less(ByVal lhs As Integer, ByVal rhs As Integer) As Boolean
  Return lhs < rhs
End Function
Array.Sort(ary, AddressOf Less)

What about closures?

It wasn't mentioned in Pauls post, but I think closures are also a necessary feature so that the following code will work.

Sub SendPartyInvitation(ByVal state As State)
  ...
  Dim legal As Integer = state.LegalDrinkingAge
  Dim names As List(Of Name) = CreateList()
  ...
  tmp = List.FindAll(names, Return ByVal(0).Age >= legal)
  ...
End Sub

The key feature, is that the "legal" local variable is available for use within the lambda expression, which would not be the case if each expression where simply an anonymous function.

November 18

Creating Text Files - Downloads

The code (create_text_file.zip) can now be found in my SkyDrive public folder.
 
To build the C++ samples, you'll need to create project files using MPC. The C++/CLI and VB.NET projects use Visual Studio 2005 (The free Express edition will probably work.) The Java projects use Eclipse and Java 1.5.
 
MPC is a program that a coworker and I designed to make working with C++ projects much easier. Most of the C++ code that we write needs to run on a wide variety of platforms, and supporting all the different build tools can be a pain. Rather than including project files or makefiles for VC++, nmake, bmake, gmake, etc, we just create simple text files that can be used to generate any of the above.
 
How simple?
 
Here's the complete set of MPC files for all seven C++ projects:
 
create_text_file/create_text_file.mpb -- This is a Base project that is used to set defaults.
project {
  after += timer
  libs += timer
  libpaths += ../timer
  includes += ../timer
}
This basically says that all projects that derive from this base are dependent on a timer library found at ../timer.
 
create_text_file/create_text_file.mwc -- This is a workspace file used to group related projects
workspace {
  cmdline += -static
  implicit = create_text_file
  exclude {
    FileStreamDotNet
  }
}
This says to automatically create a project in any suitable subdirectories containing either C++ source code or .mpc files. Each implicitly created project will automatically inherit from the .mpb above. We also exclude one directory, because it contains a C++/CLI .NET project.
 
create_text_file/timer/timer.mpc -- This is an explicit project file
project {
  staticname = timer
}
 
This file causes a project to be created with the name timer. There are staticname and sharedname keywords, allowing you to specify different names depending on whether you generate a project to create a dynamic or static library.  If only one name is specified then the other defaults to the same name.
Actually, the timer.mpc file is totally unnecessary, because it merely specifies the same behavior as the default.
 
I also explicitly specify that a static library will be created by including a cmdline += -static in the .mwc.
 
 
You can read more about MPC and download it here.
To run MPC, you'll also need Perl.
 
To create the Visual Studio 2005 version of these projects unzip create_text_file.zip to a directory, and run mwc.pl like so:
 
c:\create_text_file>mwc.pl -type vc8
 
This should create a .SLN file which can then be used to build and run the projects.
 
 
November 17

Creating Text Files - Coming Soon...

I forgot to include the source code in my previous posts, but now I can't figure out an easy way to do it. Why would my blogging server allow posting pictures, but not other files? It looks like I'm going to have to find some other server to hold the files. What a hassle.

November 16

Creating Text Files - Conclusion

SCSI Blues

On the one SCSI machine, the write speed was 2-5MB/s regardless of implementation or configuration options if WriteThrough was enabled. This probably just means that the drive does a more thorough job of honoring the WriteThrough setting. However the SCSI drive should have been able to achieve something close to its 125MB/s specified transfer rate.

 

One thing I tried to improve the performance on the SCSI machine was to install Windows Server 2003 R2. This had no benefit until I enabled a new “Enable advanced performance” option for the disk driver. This seemed to help quite a bit with the slowest tests now in the 9MB/s range. This is still quite a bit slower than the fastest system (#1) however.

results (win32 2000MB)

Write Through + Disable Caching + Defrag = 9MB/s @ <1% CPU

As above with 16MB buffer = 53MB/s @ <1% CPU

Write Through + Defrag + 16MB buffer = 58MB/s @ 28% CPU

Defrag = 55MB/s @ 20% CPU

Defrag + 16MB buffer = 47MB/s @ 21% CPU

compared to system #1

Write Through + Disable Caching + Defrag = 55MB/s @ <2% CPU

As above with 1MB buffer = 68MB/s @ <1% CPU (>1MB made no difference)

Write Through + Defrag + 1MB buffer = 68MB/s @ 8% CPU

Defrag = 90MB/s @ 10% CPU

Defrag + 1MB buffer = 56MB/s @ 6% CPU

A Grain of Salt

The performance measurements above paint a certain picture which can be a little misleading. One thing they don’t show is subjective disk thrashing that seemed to occur without Defragmenting. They also don’t show how repeatable the results were from run to run. The numbers only represent the best result I was able to achieve with the given options, but some tests were very consistent, while others varied widely. For instance the .NET results seemed to vary more than the others.

I also found that the Defrag option made a bigger difference on my older home machine #2, than it did on my work machine #1.  I suspect this was due to improvements in the newer version of the hard disk.

Before putting too much stock in these measurements, you should run them using your own platforms and compliers.

Conclusions

  • Measure performance using a simple application that has similar disk access patterns to your real application. (Or a subset of your application.)
  • Use the mechanisms above to help ensure that files are written defragmented.
  • If you have to use std::ofstream then performance may be greatly improved by providing a larger buffer, using ios::binary, preventing excessive calls to tellp(),  writing string::c_str() instead of string, and using ostream::write() instead of operator<<().
  • If you must use std::ostream, but need more powerful features such as writing encrypted or compressed files, automatically deleting a file when it’s closed, specifying caching hints to the OS, or just want better performance, then write a custom std::streambuf implementation similar to std::file_buf.
  • For the ultimate performance consider bypassing the std::ostream facility entirely.
  • SCSI drive systems behave differently, and must be tested separately. When using SCSI, you can get the best performance by using a very large buffer, and specifying WriteThrough and DisableCaching.
  • Let your ears guide you. A noisy disk drive may indicate fundamental problems with your design.
  • DotNet provided pretty good performance and features for very little effort, although some features such as Disable Buffering are inexplicably missing.
  • Java was not very performant or feature filled despite the NIO implementation. It was also very difficult to write the NIO version.

System 1 Results for 100MB

Test MB/s CPU% Defrag WriteThrough NoCache
ofstream_unopt 11 100 0 0 0
ofstream 40 50 1 0 0
  80 100 0 0 0
fast_ofstream 30 42 0 1 0
  40 50 1 1 0
  43 65 1 0 1
  75 100 0 0 0
  75 100 1 0 0
Custom FileStream 255 100 0 0 0
  267 96 1 0 0
  55 12 0 1 0
  68 19 1 1 0
  53 50 0 0 1
  77 36 1 0 1
  71 37 1 1 1
VB 2005 206 100 0 0 0
  213 100 1 0 0
  59 40 0 1 0
  63 38 1 1 0
C++ CLI 228 100 0 0 0
  237 100 1 0 0
  53 32 0 1 0
  66 38 1 1 0
Java FileOutputStream 26 100 0 0 0
  20 100 1 0 0
Java NIO 22 100 0 0 0
  24 100 1 0 0

 


Creating Text Files - Other Languages and Platforms

Some of the C++ programs above (fast_ofstream in particular) were fairly difficult to implement, and I wanted to see what could be achieved using a more rapid development tool. So far I’ve implemented the test in Visual Basic 2005, and Java (2 ways). Adam Mitz (a coworker) also contributed a C++/CLI implementation for comparison.

I also tested many of the programs on Macintosh as well as Linux, and these platforms achieved similar benefits.

In the future I’d like to try the tests under Mono on Linux, and possibly implement a Java version using FileOutputStream, since NIO didn’t seem to provide an advantage.

I ran tests on multiple systems.

  1. WinXP Core 2 Duo 6700 w/ 10Krpm drive and 2GB RAM
  2. WinXP Athlon64 3400 w/ 10Krpm drive and 1GB RAM
  3. WinXP Dual AthlonMP 2800 w/ 15Krpm SCSI drive and 1.5GB RAM
  4. Linux Athlon64 3500 w/ 10Krpm drive and 1GB RAM
  5. OSX Mac Mini Core Duo
  6. WinXP Mac Mini Core Duo

The results quoted throughout this article are those from the first configuration above, and this system unsurprisingly also gave the best performance.

All of these systems gave very similar relative results with the exception of #3, which seemed due primarily to the use of a SCSI disk drive.

Visual Basic 2005

I chose to implement the test in VB, because I find the language and IDE allow me to write more quickly than other .NET alternatives. I think the VB language has a lot of problems (such as some of the keyword names), but I find that case insensitivity, line orientation, English keywords, and other features make it much easier to work with than the C-like alternatives.

In the end it only took a few hours to create the VB version of the program, and it was actually this program that led me to create the fastest C++ programs. I was pretty satisfied with C++ fast_ofstream when it was running at 40MB/s, but the VB program was much faster with much less development effort. This led me to develop the C++ FileStream program which then led to further improvements in fast_ofstream.

No Options = 206MB/s @ 100% CPU

Defrag Only = 213MB/s @ 100% CPU

Write Through = 59MB/s @ 40% CPU

Write Through + Defrag = 63MB/s @ 38% CPU

One problem with .NET is that the System.FileStream object doesn’t support disabling the system cache. I was therefore unable to measure that option.

Another problem is that all Strings are Unicode, and there is some small overhead outside the main loop to encode them as ASCII. While this may affect this test, it also would be trivial to change the VB program to write out UTF8, UTF16, UTF32, and other formats. I wouldn’t even want to try this with the C++ version.

For what it’s worth the VB program, despite the apparent verbosity of the language, was actually shorter than most of the C++ versions.

Sub WriteSampleFile()
  Dim pt As New PerfTimer
  Dim fname As String = OUT_DIR & FILESIZE_MB & "MB.txt"
  Dim fs As FileStream = CreateFile(fname)
  Dim numLines As Long
  Dim totalBytes As Int64 = FILESIZE_MB * 1024L * 1024L
  Using fs
    Console.WriteLine("Creating {0}MB file.", FILESIZE_MB)
    If DEFRAGMENTED Then
      Dim reserve As Long = ((totalBytes \ BUFSIZE) + 1) * BUFSIZE
      fs.SetLength(reserve)
    End If
    Dim enc As Text.Encoding = Text.Encoding.ASCII
    Dim msg1() As Byte = enc.GetBytes("All work and no play makes ")
    Dim msg2() As Byte = enc.GetBytes(" a dull boy." & Environment.NewLine)
    Dim tabs() As Byte = {9, 9, 9, 9, 9, 9, 9, 9}
    Dim names As List(Of Byte()) = CreateNames()
    Do
      numLines += 1
      fs.Write(tabs, 0, CInt(numLines Mod tabs.Length))
      fs.Write(msg1, 0, msg1.Length)
      Dim name() As Byte = names.Item(CInt(numLines Mod names.Count))
      fs.Write(name, 0, name.Length)
      fs.Write(msg2, 0, msg2.Length)
    Loop Until fs.Position >= totalBytes
    fs.SetLength(totalBytes)
  End Using
  pt.PrintElapsed("Done. ")
  Dim tp As Double = CalcThroughput(FILESIZE_MB, pt.ElapsedWall)
  Console.WriteLine("Wrote {0} lines at {1} MB/s", numLines, tp)
End Sub

C++/CLI

I want to thank my coworker Adam Mitz for contributing a C++/CLI version of the program. This version was a little faster, but .NET performance varies quite a bit, and the VB version and C++ version have about the same performance overall.

No Options = 228MB/s @ 100% CPU

Defrag Only = 237MB/s @ 100% CPU

Write Through = 53MB/s @ 32% CPU

Write Through + Defrag = 66MB/s @ 38%CPU

int main(array<String^>^)
{
  using IO::FileStream;
  PerfTimer pt;
  String^ fname = gcnew String(OUT_DIR);
  fname += FILESIZE_MB;
  fname += "MB.txt";
  long long totalBytes = FILESIZE_MB * 1024 * 1024;
  Console::WriteLine("Creating {0}MB file.", FILESIZE_MB);
  FileStream^ fs = CreateFile(fname);
  if(DEFRAGMENTED)
  {
    long long reserve = static_cast<long long>(((totalBytes / (double)BUFSIZE) + 1) * BUFSIZE);
    fs->SetLength(reserve);
  }
  array<unsigned char>^ msg1 = Text::Encoding::ASCII->GetBytes("All work and no play makes ");
  array<unsigned char>^ msg2 = Text::Encoding::ASCII->GetBytes(" a dull boy.\n");
  array<unsigned char>^ tabs = {9, 9, 9, 9, 9, 9, 9, 9};
  array<array<unsigned char>^>^ names =
  {
    Text::Encoding::ASCII->GetBytes("Jack"),
    Text::Encoding::ASCII->GetBytes("Justin Michel"),
    Text::Encoding::ASCII->GetBytes("Fred Flintstone"),
    Text::Encoding::ASCII->GetBytes("Barney Rubble"),
    Text::Encoding::ASCII->GetBytes("Homer J. Simpson"),
    Text::Encoding::ASCII->GetBytes("John Jacob Jingleheimer Schmidt")
  };
  long numLines(0);
  do
  {
    ++numLines;
    fs->Write(tabs, 0, numLines % tabs->Length);
    fs->Write(msg1, 0, msg1->Length);
    array<unsigned char>^ name = names[numLines % names->Length];
    fs->Write(name, 0, name->Length);
    fs->Write(msg2, 0, msg2->Length);
  }
  while(fs->Position < totalBytes);
  fs->SetLength(totalBytes);
  pt.PrintElapsed("Done. ");
  double tp = CalcThroughput(FILESIZE_MB, pt.ElapsedWall);
  Console::WriteLine("Wrote {0} lines at {1} MB/s", numLines, tp);
  return 0;
}

Java NIO

I first implemented the Java version using NIO rather than the much simpler FileOutputStream, because the latter didn’t appear to support the necessary features to create defragmented files, and I thought that NIO should give the best potential performance. Later, I rewrote the program using FileOutputStream, because the complexity of using NIO didn’t seem to have any perceivable benefit.

The first thing I noticed is that this implementation of the test program was far more difficult than any of the others. This is despite the fact that I have the most recent experience with Java, and had even implemented a project using NIO within the last six months. The biggest problems involved the complication of dealing with ByteBuffer and String encoding. Compare the code needed to send a String encoded as ASCII in VB.NET.

Dim enc As Text.Encoding = Text.Encoding.ASCII
Dim buf() As Byte = enc.GetBytes("This is a test.")
fs.Write(buf, 0, buf.Length)

To the equivalent Java NIO.

Charset cs = Charset.forName("US-ASCII");
CharsetEncoder enc = cs.newEncoder();
enc.onMalformedInput(CodingErrorAction.REPLACE);
enc.onUnmappableCharacter(CodingErrorAction.REPLACE);
String str = “This is a test.“;
ByteBuffer buf = ByteBuffer.allocateDirect(str);
enc.reset();
CharBuffer cb = CharBuffer.wrap(str);
CoderResult cr = enc.encode(cb, buf, true);
if (cr == CoderResult.OVERFLOW) {
throw new Exception(“WTF”);
} // CoderResult.UNDERFLOW ignored
cr = enc.flush(buf);
if (cr == CoderResult.OVERFLOW) {
throw new Exception(“WTF”);
} // CoderResult.UNDERFLOW ignored
channel.write(buf);

Another problem with the Java implementation is that I couldn’t find a good way to implement most of the features. There’s no capability to specify Write Through, Disable Caching, or Sequential options. Furthermore the FileChannel.truncate() method doesn’t support extending a file. I had to resort to using the same Defragment method that I used for ofstream.

No Options = 22MB/s

Defrag Only = 24MB/s

I didn’t know a way to measure CPU usage in Java, but it seemed to be about 100%.

The much simpler FileOutputStream implementation gave similar performance.

No Options = 26MB/s

Defrag Only = 20MB/s

Next, I'll try to wrap this all up with some conclusions and a summary of what we've learned.


Creating Text Files - fast_ofstream and Custom FileStream

fast_ofstream

As we saw with the analysis of the ofstream program we can achieve close to optimal throughput, but not with advanced features such as defragmenting that we require. One solution that doesn’t involve starting from scratch is to write a custom std::streambuf implementation that can be used with the standard C++ io streams. The application using fast_ofstream was able to write a defragmented file with a single loop using tellp() at 75MB/s. If we enable Write Through then the Defragment option becomes important.

reserve

By using our own streambuf implementation, we can implement a better Defragment feature. We use an interface similar to std::vector in which we call a reserve() method passing the desired size in bytes. This is much better, because although we allocate space for the reserved size, we don’t actually have to use the space or know for sure how big the file will be. Any unused space will automatically be reclaimed when the stream is closed.

tellp()

Since we’re implementing our own buffer, we can make tellp() more efficient, thereby allowing us to use the simpler code from the original example while still retaining most of the performance benefits.

final results (100MB)

Write Through = 30 MB/s @ 42% CPU

Write Through + Defragment = 40 MB/s @ 50% CPU

DisableCaching = 40 MB/s @ 65% CPU

DisableCaching + Defragment = 43 MB/s @ 65% CPU

No Options = 75MB/s @ 100% CPU

Defragment Only = 75MB/s @ 100% CPU

The speed is close to what we observed with the best case using the default ofstream, but we’re able to use more advanced features, the most important being Defragment. Defragment is crucial to avoid overall file system fragmentation and to ensure optimal read speed. The current fast_ofstream implementation is not very robust, and it’s possible that even faster results could be achieved.

The biggest remaining problems are:

1. We’re still using 100% CPU at least when running at full speed.
2. We’re still not achieving the best performance.

Custom FileStream

It should be possible to achieve even better performance by avoiding the standard C++ iostream interfaces completely. I created a simple class as a test case. 
This simple class supports all the options identified previously, and has just enough operations to implement the test program.

enum FS_OPTIONS { 
  fsNone = 0, 
  fsDisableCaching = 1, 
  fsWriteThrough = 2, 
  fsSequential = 4 
}; 
class FileStream { 
  FileStreamImpl* impl_; 
public: 
  FileStream(const std::string& fname, FS_OPTIONS opts); 
  ~FileStream(); 
  void write(const std::string& s); 
  void write(const std::string& s, 
  std::string::size_type offset, 
  std::string::size_type length); 
  unsigned long long size() const; 
  void reserve(unsigned long long bytes); 
private: 
  FileStream(const FileStream&); 
  FileStream& operator=(const FileStream&); 
};
final results (100MB)

No Options = 255MB/s @ 100% CPU

Defrag Only = 267MB/s @ 96% CPU

Write Through = 55MB/s @ 12% CPU

Write Through + Defrag = 68 MB/s @ 19% CPU

Disable Cache = 53MB/s @ 50% CPU

Disable Cache + Defrag = 77MB/s @ 36% CPU

Disable Cache + Write Through + Defrag = 71MB/s @ 37% CPU

For many applications this could be worth the effort of implementing a custom FileStream.

Next I'll discuss the use of other languages and platforms.


Creating Text Files - Optimizing std::ofstream

Where to start?

Just to make sure I was on the right track, I decided to write a quick proof-of-concept Windows program to verify that I could come close to the desired performance. This program works by creating a 64K buffer filled with ‘X’ characters, and writes it repeatedly until it gets a file of the desired size. The program supports the following options.

  1. Sequential Scan
    Windows supports opening a file with a special flag to hint that the file will be accessed sequentially.
  2. Write Through
    This flag is supposed to ensure that any files are written directly to disk instead of being held in a cache and flushed later.
  3. Disable Caching
    This flag disables the Windows system cache for the file. In theory you could enable write-through, but still have the data in the system cache for reading. If the data is not expected to be read again, then this flag prevents filling the cache with unwanted data.
  4. Defragment
    Windows and Unix both support increasing the size of a file in one step. This allows the file to be written in as few fragments as possible.

With this test program I was able to achieve 77MB/s with caching disabled and 145MB/s with caching enabled for 1000MB test files. Smaller files were even faster, sometimes approaching 1200MB/s effective speeds with caching enabled.

For this test, the Sequential Scan option didn’t seem to make much difference. This makes sense, because I’m already accessing the file in fairly large chunks. It’s possible that this flag is of primary benefit when reading smaller amounts of data, or that the hint is ignored by my drivers and/or hardware.

The Write Through and Disable Caching options seem to have about the same effect. Without these options the program is obviously faster than the disk can physically handle.

The Defragment option was a little less conclusive. When I had all the above options turned on, I was only able to write at 65MB/s, but turning off Defragment only slowed it down to 63MB/s. However, without Write Through the Defragment option was good for 8MB/s difference. Despite this, it was very apparent when the Defragment option was enabled from a disk noise perspective, and as I mentioned previously, file fragmentation is of paramount importance when reading the files.

Optimizing std::ofstream

Now that I had some idea of what was possible given the underlying OS, I decided to attempt to speed up the original ofstream program. Some of the changes that made the most difference were surprising, and should be helpful when optimizing any program that uses ofstream.

defragment

ofstream does not provide an option to directly create a large file, but by using the following statements I was able to create a defragmented file.

out.rdbuf()->pubseekpos(num_bytes);
out.write("\n", 1);
out.rdbuf()->pubseekpos(0);

Unfortunately, this also slowed the program to 5.5MB/s from the original 11MB/s. For some applications this might be a worthwhile tradeoff, but hopefully this optimization will work better when combined with others.

buffer size

One problem with the existing program is that it takes 1200+ writes per MB to build the file. We should be able to improve performance by telling the ofstream to use a larger buffer.

char buffer[BUFSIZE];
out.rdbuf()->pubsetbuf(buffer, BUFSIZE);

When I ran this program with BUFSIZE=64K I was surprised to find that it slowed from 11MB/s to 1MB/s. What I discovered is that smaller BUFSIZE values actually increased performance. By using 1K I was able to achieve 15MB/s. This is a good example of why you should always test for performance before making design decisions based on faulty assumptions. This is not premature optimization. Addressing these issues early in a project before it’s too late is just good engineering.

Interestingly, BUFSIZE=64K resulted in 1037 writes/MB while BUFSIZE=1K resulted in 2000 writes/MB. Also, turning on the defragment option had no additional negative performance impact when using BUFSIZE=64K.

use ios::binary

Opening the stream in binary mode results in a speed increase to 23.5MB/s. This prevents automatic translation of ‘\n’ to “\r\n”, but has no other effect as far as I can tell. Furthermore, the Defragment option only slows this to 19MB/s, and the Buffer Size option now works as originally expected, decreasing the number of disk writes to 16 per MB (100MB/64K=16) resulting in 26MB/s. Combining Defragment with Buffer Size was still 19MB/s however.

writing single chars

Changing the loop that writes out tabs to write ‘\t’ instead of “\t” results in .7MB/s speedup.

better indentation

Although writing ‘\t’ instead of “\t” was an improvement, we can do better by creating a string of tab characters, and then writing a variable number of them.

const string tabs(MAX_TABS, ‘\t’);
…
int indent = num_lines % MAX_TABS;
if (indent > 0) {
  out.write(tabs.c_str(), indent - 1);
}

This results in 32MB/s with all options except Defragment enabled.

eliminate tellp()

When I first ran the (so I thought) fully optimized version of the program on my Mac Mini under OSX I was surprised to find that I could only get .8MB/s. A little tinkering with the code revealed that the call to tellp() was taking most of the time. I eliminated tellp() and the Mac Mini jumped to 40MB/s.

const string tabs(MAX_TABS, '\t');
const string msg1 = "All work and no play make ";
const string msg2 = " a dull boy.\n";
// Calculate num_lines in separate loop first
for (int bytes = 0; bytes < num_bytes;) {
  ++num_lines;
  bytes += num_lines % MAX_TABS;
  const string& name = names[num_lines % names.size()];
  bytes += name.length();
  bytes += msg1.length();
  bytes += msg2.length();
}
for (unsigned line = 1; line <= num_lines; ++line) {
  int indent = line % MAX_TABS;
  if (indent > 0) {
    out.write(tabs.c_str(), indent - 1);
  }
  const string& name = names[line % names.size()];
  out << msg1 << name << msg2;
}

The first loop duplicates the logic in the original loop without actually writing anything. This allows the second loop to use a simple for() loop, because we now know exactly how many lines to write. This is the first change that really complicated our code, but it also has a big performance benefit. The program now writes at 60MB/s.

writing strings

One surprising thing I found was that writing std::string can be significantly slower than writing const char*. Simply changing the above code to write:

out << msg1.c_str() << name.c_str() << msg2.c_str();

resulted in a speedup to 75MB/s. Making this change before the tellp() change did not have any significant effect.

ostream::write

Looking at the source code for operator<<(ostream& out, const char* str) showed that it essentially calls “out.write(str, strlen(str))”. We already know the length of each string, so we should be able to write faster using:

out.write(msg1.c_str(), msg1.length());
out.write(name.c_str(), name.length());
out.write(msg2.c_str(), msg2.length());

This resulted in a speedup to 80MB/s

defragment revisited

With all other options in place writing the file defragmented now writes at 40MB/s.

final results (100MB)

No Options = 80MB/s @ 100% cpu

Defrag Only = 40MB/s @ 50% cpu

The biggest remaining problems are:

  1. We’re still using 100% CPU at least when running at full speed.
  2. We’re not able to write defragmented files without halving our performance.
  3. We have no mechanism for specifying advanced options.
  4. Considering that caching is enabled, we’re still not very fast.


Performance Quiz - Creating Text Files

I have an interesting programming challenge. Write a program to create a sample text file of a specified size in megabytes. The program should write out as many lines as possible of the format “[tabs]All work and no play makes [name] a dull boy.”, which you may recognize as the text that Jack Torrance types repeatedly in 1980’s The Shining It seems appropriate. The program should provide a little variation to the lines by starting each line with a varying number of tabs, and alternating the name in the sentence.

Here’s a C++ program that does the job:  

#include <iostream>
#include <fstream>
#include <string>
using namespace std;
const int FILE_BYTES = 100 * 1024 * 1024;
const int MAX_INDENT = 8;
const int NUM_NAMES = 6;
int main(int argc, char* argv[]) {
  const string names[NUM_NAMES] = {
    "Jack", 
    "Justin Michel",
    "Fred Flintstone",
    "Barney Rubble",
    "Homer J. Simpson",
    "John Jacob Jingleheimer Schmidt"
  };
  ofstream out("sample.txt", ios_base::out);
  for (unsigned num_lines = 1; out.tellp() <= FILE_BYTES; ++num_lines)
  {
    for (unsigned int i = 0; i < num_lines % MAX_INDENT; ++i) {
      out << "\t";
    }
    const string& name = names[num_lines % NUM_NAMES];
    out << "All work and no play makes " << name << " a dull boy.\n";
  }
  return 0;
}

When I run this program, I find that it’s able to generate a 100MB file at almost 11 MB/s, however I noticed several issues that indicate this could be improved.

  1. It’s completely CPU-bound.
  2. There is a lot of noisy disk-activity.
  3. The transfer rate does not come close to posted test results.
  4. No control over I/O options.

It was surprising to me that the above program was CPU-bound. After all, there’s not much apparent processing going on. Few calculations are done within the loop, and it writes the same short strings over and over. If this simple program uses 100% of the CPU then any less trivial program is only going to use more. Ideally we’d like to have close to 0% CPU usage so that the program could make use of the CPU for other functionality. For example, an application with a logging facility using code similar to that above would spend all the limited computer resources on logging with nothing left over for running the actual application.

The noisy disk activity provides a clue to the biggest single problem. C++ ofstream will automatically grow the file as needed, but if you think about it this can only lead to excessive file fragmentation. The OS has no idea how big the file is going to get, so it will allocate small chunks as needed. What we’d like is some way to tell the OS that we’re preparing to write 1GB so that it can allocate one nice consecutive chunk. This should allow optimal write speed assuming we write everything sequentially, but more importantly it will help prevent disk fragmentation and allow the file to be read in much faster. (Previous testing has shown that a fragmented file can take 10-15 times longer to read.)

Another reason for the excessive disk activity is that ofstream seems to use a small buffer. I know this because my real test program has instrumentation that measures Disk, CPU, and Memory usage, and it shows that a 100MB file is written using 125,820 disk writes.

I found the following review of my hard disk, which indicates that I should be able to achieve about 75MB/s presumably with caching disabled.

http://techreport.com/reviews/2006q2/raptor-wd1500/index.x?pg=12

I also checked the specs on the manufacturers website, which claim that 84 MB/s should be possible.

Western Digital WD1500ADFD

The other problem with using ofstream is that you don’t have any mechanism for specifying advanced features. For example, in Windows you can create a file with flags that specify Compression, Encryption, Temporary (Tells the OS to avoid writing to a physical file if possible), DeleteOnClose, DisableSystemCache, Sequential/Random access (A hint to the cache), DisableDriveCache, and many other options.

In my next post, I'll attempt to optimize the program while still using std::ofstream.