[Libvir] [PATCH] Enhance virBuffer code

Fri Dec 14 13:17:31 UTC 2007

"Richard W.M. Jones" <rjones at redhat.com> wrote:
> Jim Meyering wrote:
>> "Richard W.M. Jones" <rjones at redhat.com> wrote:
>>
>>> Jim Meyering wrote:
>>>> "Richard W.M. Jones" <rjones at redhat.com> wrote:
>>>>> Jim Meyering wrote:
>>>>>> What do you think of using this?
>>>>>>
>>>>>>   isascii (*p) && isalnum (*p)
>>>>> I'm not sure I'm qualified to say what this does on EBCDIC, but quite
>>>>> likely lots of other code breaks there too anyway.  This is nicely
>>>>> self-documenting anyway.
>>>> As Daniel suggested, isalnum is locale-sensitive.
>>>> If there's a locale with an alphabetic byte that is outside
>>>> the logical a-zA-Z range, yet still within 0..127, then the above
>>>> expression will give a false-positive for that byte.
>>>>
>>>> I've been inclined to stop worrying about EBCDIC for years, but a quick
>>>> search on the web finds that people are still stuck using it, and do
>>>> report bugs in ASCII-assuming code.
>>>>
>>>> This is why autoconf goes to the trouble of doing this:
>>>>   tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
>>>> not this:
>>>>   tr a-z A-Z
>>>> to convert to upper case.
>>> Another factor to consider here is that it doesn't matter if this
>>> function over-escapes, but it does matter if the function
>>> under-escapes. That is to say, it could escape every character as a
>>> %xx hex code, which would be ugly and slightly inefficient but not
>>> wrong.
>>
>> IMHO, if you don't use the all-enumerating switch-based code that Daniel
>> objects to, it'd be good to document (in both loops) that the test is
>> inaccurate with EBCDIC, and explain why it's ok to get false positives.
>>
>> Without comments, people might be tempted to use a similar test in a
>> context where the differences matter.
>
> OK, how about this?
>
> Rich.
>
> +    for (p = str; *p; ++p) {
> +        /* Want to escape only A-Z and 0-9.  This may not work on
> EBCDIC. */
> +        if (isascii (*p) && isalnum (*p))

Actually, with that, the code is at the mercy of locale definitions,
(which are notoriously unreliable), and it probably works with EBCDIC.

I wrote the following and tested a few systems:

#include <ctype.h>
#include <stdio.h>
#include <locale.h>
int
is_alphanum (char c)
{
  switch (c)
  {
    /* generated by LC_ALL=C perl -e \
       "print map {qq(case '\$_': )}('a'..'z', 'A'..'Z', '0'..'9')"|fmt */
    case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g':
    case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n':
    case 'o': case 'p': case 'q': case 'r': case 's': case 't': case 'u':
    case 'v': case 'w': case 'x': case 'y': case 'z': case 'A': case 'B':
    case 'C': case 'D': case 'E': case 'F': case 'G': case 'H': case 'I':
    case 'J': case 'K': case 'L': case 'M': case 'N': case 'O': case 'P':
    case 'Q': case 'R': case 'S': case 'T': case 'U': case 'V': case 'W':
    case 'X': case 'Y': case 'Z': case '0': case '1': case '2': case '3':
    case '4': case '5': case '6': case '7': case '8': case '9':
      return 1;
    default:
      return 0;
  }
}

int
main ()
{
  setlocale (LC_ALL, "");
  for (unsigned int i = 0; i < 256; i++)
    if (isalnum (i) && isascii (i) && !is_alphanum (i))
      printf ("%d: %c", i, i);

  return 0;
}
-------------------------------
I compiled and ran it against all installed locales like this:

gcc -o k k.c && for i in $(locale -a); do \
  test "$(LC_ALL=$i ./k|wc -l)" = 0 || echo $i;done

On RHEL4, RHEL5, and rawhide, it finds this exception:

  vi_VN.tcvn

Running manually in that locale suggests something is fishy:

  $ LC_ALL=vi_VN.tcvn ./k
  1:
  2:
  4:
  5:
  6:
  17:
  18:
  19:
  20:
  21:
  22:
  23:

Surprise, surprise...
So in this locale, using "isascii (*p) && isalnum (*p)" would
*under*quote.

I didn't expect to find such a convincing argument.
I stand by my suggestion to use the switch statement.