Re: [Libguestfs] hivex: some issues (key encoding, ...) and suggested fixes

On 26/02/11 18:56, Török Edwin wrote:

libhivex seems to do a great job at parsing hives most of the time, but
there are some issues with a few registry keys.

These can be worked around in the application that uses libhivex, but I
think it'd be better if libhivex handled these itself.

1. UTF16 string in REG_SZ that has garbage after the \0\0

There is code in hivex.c to handle this already but I think it has a typo:

   /* Deal with the case where Windows has allocated a large buffer
    * full of random junk, and only the first few bytes of the buffer
    * contain a genuine UTF-16 string.
    * In this case, iconv would try to process the junk bytes as UTF-16
    * and inevitably find an illegal sequence (EILSEQ).  Instead, stop
    * after we find the first \0\0.
    * (Found by Hilko Bengen in a fresh Windows XP SOFTWARE hive).
   size_t slen = utf16_string_len_in_bytes_max (data, len);
   if (slen>  len)
     len = slen;

   char *ret = windows_utf16_to_utf8 (data, len);

slen is only used to increase length of data, but I think it should be
decreasing it (to stop earlier).

Yup, that certainly looks like a bug.

2. Non-ascii node names

I found a node with a \xDC (Ü) in it:

hivex.c has a comment like this:
   /* AFAIK the node name is always plain ASCII, so no conversion
    * to UTF-8 is necessary.  However we do need to nul-terminate
    * the string.

I think hivex should convert the node names from CP1252 (or is it
ISO-8859-1?) to UTF-8.

Workaround: I do the CP1252 ->  UTF8 conversion myself for now

3. node_get_child is slow

Documentation issue, it should say that using node_get_child is slow
(because registry doesn't have an index, and you do a linear search).

Workaround: I create a map of node names to children of a node, a lookup
in that is faster than using node_get_child repeatedly

4. hivexml output is not a well-formed XML

See problem #1 and #2, if value_string and node_name are fixed to not
dump the binary garbage and just return UTF8 then I think hivexml's
output would pass xmllint.

As it happens, I opened a BZ on this just the other day. I think there's an additional element here: it seems that sometimes a registry key genuinely contains non-text data. An example is HKLM/SOFTWARE/Microsoft/MSDTC/Security/XAKey, which I'm guessing is a cryptographic key. This would require a CDATA section. However, it's not clear to me how the tool can reliably infer that a value is binary data without specific knowledge of the schema.

Matthew Booth, RHCA, RHCSS
Red Hat Engineering, Virtualisation Team

GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

