[Freeipa-devel] [PATCH] 971 detect binary LDAP data

Tue Feb 28 20:50:29 UTC 2012

Petr Viktorin wrote:
> On 02/28/2012 04:45 PM, Rob Crittenden wrote:
>> Petr Viktorin wrote:
>>> On 02/28/2012 04:02 AM, Rob Crittenden wrote:
>>>> Petr Viktorin wrote:
>>>>> On 02/27/2012 05:10 PM, Rob Crittenden wrote:
>>>>>> Rob Crittenden wrote:
>>>>>>> Simo Sorce wrote:
>>>>>>>> On Mon, 2012-02-27 at 09:44 -0500, Rob Crittenden wrote:
>>>>>>>>> We are pretty trusting that the data coming out of LDAP matches
>>>>>>>>> its
>>>>>>>>> schema but it is possible to stuff non-printable characters into
>>>>>>>>> most
>>>>>>>>> attributes.
>>>>>>>>>
>>>>>>>>> I've added a sanity checker to keep a value as a python str type
>>>>>>>>> (treated as binary internally). This will result in a base64
>>>>>>>>> encoded
>>>>>>>>> blob be returned to the client.
>>>>>>>>
>>>>>>>> Shouldn't you try to parse it as a unicode string and catch
>>>>>>>> TypeError to
>>>>>>>> know when to return it as binary ?
>>>>>>>>
>>>>>>>> Simo.
>>>>>>>>
>>>>>>>
>>>>>>> What we do now is the equivalent of unicode(chr(0)) which returns
>>>>>>> u'\x00' and is why we are failing now.
>>>>>>>
>>>>>>> I believe there is a unicode category module, we might be able to
>>>>>>> use
>>>>>>> that if there is a category that defines non-printable characters.
>>>>>>>
>>>>>>> rob
>>>>>>
>>>>>> Like this:
>>>>>>
>>>>>> import unicodedata
>>>>>>
>>>>>> def contains_non_printable(val):
>>>>>> for c in val:
>>>>>> if unicodedata.category(unicode(c)) == 'Cc':
>>>>>> return True
>>>>>> return False
>>>>>>
>>>>>> This wouldn't have the exclusion of tab, CR and LF like using ord()
>>>>>> but
>>>>>> is probably more correct.
>>>>>>
>>>>>> rob
>>>>>>
>>>>>> _______________________________________________
>>>>>> Freeipa-devel mailing list
>>>>>> Freeipa-devel at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/freeipa-devel
>>>>>
>>>>> If you're protecting the XML-RPC calls, it'd probably be better to
>>>>> look
>>>>> at the XML spec directly: http://www.w3.org/TR/xml/#charsets
>>>>>
>>>>> Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
>>>>> [#x10000-#x10FFFF]
>>>>>
>>>>> I'd say this is a good set for CLI as well.
>>>>>
>>>>> And you can trap invalid UTF-8 sequences by catching the
>>>>> UnicodeDecodeError from decode().
>>>>>
>>>>
>>>> Replace my ord function with a regex that looks for invalid characters.
>>>> Now catching that exception too and leaving as a str type.
>>>>
>>>> rob
>>>
>>> You're matching an Unicode regex against a bytestring. It'd be better to
>>> match after the decode.
>>>
>>
>> The unicode regex was force of habit. I purposely do this comparison
>> before the decoding to save decoding something that is binary. Would it
>> be acceptable if I just remove the u'?
>>
>> rob
>
> No: re.search('[\uDFFF]', u'\uDFFF'.encode('utf-8')) --> None
> Actually I'm not sure what encoding is used behind the scenes here; I
> wouldn't recommend mixing Unicode and bytestrings even if it did work on
> my machine.

The data coming out of LDAP is a plain string. We encode to utf-8 if is 
not a binary as defined in _SYNTAX_MAPPING in ipaserver/plugins/ldap2.py

> Also, match() only looks at the beginning of the string; use search().
> Sorry for not catching that earlier.

good point.

>
>
> OCD notes:
> - the "matches" variable name is misleading, there's either one match or
> None.
> - use `is not None` instead of `!= None` as per PEP8.
> - The contains_non_printable method seems a bit excessive now that it
> essentially just contains one call. Unless it's meant to be subclassed,
> in which case the docstring should probably look different.
>

It doesn't need to be a public method. I broke it out mostly because 
embedding the documentation on why it is there overwhelmed encode().

rob