Normally I really dislike being told "you are doing it wrong", especially when it is true, but sometimes that's the only conclusion you can reach.

Today I booked a bunch of things for a trip to the South Island in January. Payment for everything online is ubiquitiously via credit card (or PayPal which I avoid due to their poor treatment of customers, or Amazon which I avoid because of patent abuse ). Which meant that I got to fight with a bunch of input validation for credit card information.

As others have observed many of these web forms are written by programmers that think that the only solution to getting the data in the format that they want is to insist that the user types it in exactly as they had in mind (sometimes not even revealing what they had in mind until the error message, or in a few particularly blatant instances not revealing it at all, turning it into a text based puzzle). While it might have been true in the 1980s that programmer time was more valuable than user time, or that CPU time was more valuable than user time, it's certainly not true now. A little effort and a trivial amount of CPU time allows reformatting the input when necessary to the format required.

For instance:

  • Credit card numbers are most easily entered in groups of 4 separated by spaces (or hypens) because the human brain can remember a small number of small chunks easier than a large set of things, and it's much easier to check all the digits are present and accounted for. But the backend algorithms want a raw string of numbers. The wrong (but alas common) solution is to insist that the user types in a string of 16 digits without any grouping to match the backend. The right solution is to use a tiny bit of CPU to transform the input, eg:

    $cc =~ s/\D//g;     # Perl: strip out anything not a digit
    

    and then do your other validation.

  • Some sites try to avoid the "way too many digits" problem by insisting that the digits be entered into separate boxes for each group. Right idea, wrong implementation. They either end up trying to auto-tab into the next box for the user frustrating users who are used to form-based UIs and are tabbing over themselves, or they don't auto-tab frustrating users who are just used to typing it all in. (The correct solution if you must go down this path is to eat the tab to next field if it happens immediately after you've auto-jumped to the new field, and to properly handle backspacing back across the field boundaries. Few if any sites get this part right.) But even if you solve that problem splitting into lots of separate boxes defeats cut'n'pasting the whole number. In theory that too could be solved, but I've yet to see a site get that part right.

  • There is a international standard for telephone numbers and "+" is widely recognised as the prefix for fully qualified phone numbers (and adopted by the GSM standards; see also World Telephone Number Guide). If you want a phone number from the user, you should accept it in fully qualified International format, starting "+CC" (where "+" is a literal, and "CC" is the country code), followed by "AAA" (area code, if any) and then the digits for the local number, eg +64-4-XXX-YYYY. The number of digits to follow the area code depends on the digits that lead up to it, as most PBXes know. If you don't want to embed all the phone numbering formats of the world then you should just accept up to 15 digits, the maximum required by E.164. (And as above you should accept the phone number with or without spaces or hypens, and just spend some CPU time getting it into the format you want.) In particular assuming that the whole world uses the NANP is just wrong, even if you make a token acceptance of the rest of the world by accepting a country code that is not 1.

  • Area codes do not start with a "0". The "0" when you dial without the country code is to signal "hey, here comes an area code first" not part of the area code. If you accept a country code (and you should, see above) then the area code after it shouldn't start with a 0. Insisting that it should (as one website I dealt with today did) is just wrong. Writing +64-04-XXX-YYYY is just wrong. If you accept the area code in its own box, don't insist that it starts with a "0"; if you desperately want the "0" there in your data store, you can always add it in yourself. (In general none of the components should start with a 0, but poor number planning by various telcos means that some local numbers do in order to get one more Hail Mary pass for twice as many numbers. Unfortunately they're not alone in doing it wrong.)

  • Credit card numbers already include an indication of what type of card (CC anatomy). You don't need to force the user to pick what type of card they are using; it's nice to show a list or logos of the card types you accept so they know what can be used, but if they don't fill it in, you can just assume that it is the type of card that the number indicates it is! (And you can sanity check the number entered via the Luhn mod 10 algorithm, by using a little bit of CPU time.)

  • There's an international standard for dates, and it specifies that dates should be listed in big endian, which means year first, then month, then day. It's not only internationally recommended, it's unambiguous. If you accept dates in a free text field and don't accept ISO 8601 formatted dates you are doing it wrong. (It's also easy to detect that the date is in that format, since the first component is the year which is always bigger than a day or a month.) And if you expect your dates in mid endian format (month, day, year) like the USA you are really doing it wrong: it not only doesn't sort correctly (which unfortunately also rules out the otherwise sensible little endian format, of day/month/year), it is also ambigious for 132 days out of the year.

  • People know how to spell and format their own names. And are rather attached to that formatting. If the user enters their name correctly and you reformat it to be incorrect you are doing it wrong. Typically the excuse for this is to ensure that names are properly mixed case. Aside from the fact that some people like their names in all lower case, if the user entered the name in mixed case their idea of how to mix the case is certainly more correct than your idea of mixed case -- so enforcing first character upper case, reset lower case on the name is doing it wrong. (Simple algorithm: if there is at least one upper case character and at least one lower case character in the name, assume it's been entered correctly and leave it alone. Even simpler algorithm: leave it alone -- if it's not formatted the way that the user wanted it, they can always re-enter it.) For extra fail today, one website managed to preserve the formatting of my surname ("McNeill") in one place and get it wrong ("Mcneill") in another place. (And, no, you can't just assume that the letter after "Mc" will be upper case; some people don't do that. Really, the user knows how to enter their name.) (Also if you are converting everything to upper case for your COBOL system and trying to reformat to be presentable afterwards you are doing it wrong. Lower case has been widely available for computer storage since the 1960s.)

And that's without even getting into physical addresses (hint: not all countries have states, postal codes can be different lengths and formats, etc -- typically the user knows better than you what is required) or the many other ways that seeking user input can be done wrong. Or the use of am/pm on booking forms (the Hamming Distance is just too small between "am" and "pm"); use the 24 hour clock like most of Europe. (Also Common Locale Data Project, which alas -- like many computer OS default setups -- reveals that New Zealand is doing it wrong, at least for dates and times.)

(And as Michael points out: Parsing HTML with regex considered harmful.)

And now for something completely different: Wired writer Evan Ratliff tried to vanish for a month, but ultimately gets caught because of (a) not following his own rules (accessing the network without Tor and his relay machines) and (b) running short on money (and hence taking "risky" challenges to win spot prizes).

His stated reason for stopping using Tor was speed, which the Tor developers are aware of and trying to do something about (the major issue seems to be prioritising bandwidth over latency). But there also seems to be an element of over confidence due to not having been caught. The money issue seems to be simply a matter of underestimating the amount of cash required, plus spending a bunch of it to continue carrying on parts of his normal life (eg, going to a sports match) and not working.

But even with those quibbles, it's still interesting reading covering some of the traces that we leave behind. Steven Rambam covered much of what was done to find Evan in another challenge, which is becoming the book Stealing Your Own Identity due out in December (he also spoke at The Last Hope based on the same challenge).