[Freeipa-devel] [PATCH] 971 detect binary LDAP data

Thu Mar 29 12:07:52 UTC 2012

On 29.3.2012 00:20, Rob Crittenden wrote:
> Jan Cholasta wrote:
>> On 29.2.2012 15:45, Rob Crittenden wrote:
>>> Jan Cholasta wrote:
>>>> On 28.2.2012 18:58, Rob Crittenden wrote:
>>>>> Jan Cholasta wrote:
>>>>>> On 28.2.2012 18:02, Petr Viktorin wrote:
>>>>>>> On 02/28/2012 04:45 PM, Rob Crittenden wrote:
>>>>>>>> Petr Viktorin wrote:
>>>>>>>>> On 02/28/2012 04:02 AM, Rob Crittenden wrote:
>>>>>>>>>> Petr Viktorin wrote:
>>>>>>>>>>> On 02/27/2012 05:10 PM, Rob Crittenden wrote:
>>>>>>>>>>>> Rob Crittenden wrote:
>>>>>>>>>>>>> Simo Sorce wrote:
>>>>>>>>>>>>>> On Mon, 2012-02-27 at 09:44 -0500, Rob Crittenden wrote:
>>>>>>>>>>>>>>> We are pretty trusting that the data coming out of LDAP
>>>>>>>>>>>>>>> matches
>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>> schema but it is possible to stuff non-printable characters
>>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>> attributes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've added a sanity checker to keep a value as a python str
>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>> (treated as binary internally). This will result in a base64
>>>>>>>>>>>>>>> encoded
>>>>>>>>>>>>>>> blob be returned to the client.
>>>>>>
>>>>>> I don't like the idea of having arbitrary binary data where unicode
>>>>>> strings are expected. It might cause some unexpected errors (I have a
>>>>>> feeling that --addattr and/or --delattr and possibly some plugins
>>>>>> might
>>>>>> not handle this very well). Wouldn't it be better to just throw away
>>>>>> the
>>>>>> value if it's invalid and warn the user?
>>>>>
>>>>> This isn't for user input, it is for data stored in LDAP. User's are
>>>>> going to have no way to provide binary data to us unless they use the
>>>>> API themselves in which case they have to follow our rules.
>>>>
>>>> Well my point was that --addattr and --delattr cause an LDAP search for
>>>> the given attribute and plugins might get the result of a LDAP
>>>> search in
>>>> their post_callback and I'm not sure if they can cope with binary data.
>>>
>>> It wouldn't be any different than if we had the value as a unicode.
>>
>> Let's see what happens if the mail attribute of a user contains invalid
>> UTF-8 (ff 62 30 72 6b 65 64):
>>
>> $ ipa user-find jdoe
>> --------------
>> 1 user matched
>> --------------
>> User login: jdoe
>> First name: John
>> Last name: Doe
>> Home directory: /home/jdoe
>> Login shell: /bin/sh
>> Email address: /2IwcmtlZA==
>> UID: 526
>> GID: 526
>> Account disabled: False
>> Password: False
>> Kerberos keys available: False
>> ----------------------------
>> Number of entries returned 1
>> ----------------------------
>>
>> $ ipa user-mod jdoe --addattr mail=jdoe at example.com
>> ipa: ERROR: an internal error has occurred
>>
>> The internal error is:
>> Traceback (most recent call last):
>> File "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line
>> 302, in wsgi_execute
>> result = self.Command[name](*args, **options)
>> File "/usr/lib/python2.7/site-packages/ipalib/frontend.py", line 438, in
>> __call__
>> ret = self.run(*args, **options)
>> File "/usr/lib/python2.7/site-packages/ipalib/frontend.py", line 696, in
>> run
>> return self.execute(*args, **options)
>> File "/usr/lib/python2.7/site-packages/ipalib/plugins/baseldap.py", line
>> 1217, in execute
>> ldap, dn, entry_attrs, attrs_list, *keys, **options
>> File "/usr/lib/python2.7/site-packages/ipalib/plugins/user.py", line
>> 532, in pre_callback
>> entry_attrs['mail'] = self.obj._normalize_email(entry_attrs['mail'])
>> File "/usr/lib/python2.7/site-packages/ipalib/plugins/user.py", line
>> 338, in _normalize_email
>> norm_email.append(m + u'@' + config['ipadefaultemaildomain'][0])
>> UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0:
>> invalid start byte
>>
>> $ ipa user-mod jdoe --delattr mail=/2IwcmtlZA==
>> ipa: ERROR: mail does not contain '/2IwcmtlZA=='
>>
>> $ ipa user-mod jdoe --delattr mail=`echo 'ff 62 30 72 6b 65 64' | xxd -p
>> -r`
>> ipa: ERROR: UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in
>> position 5: invalid start byte
>> Traceback (most recent call last):
>> File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 1242, in run
>> sys.exit(api.Backend.cli.run(argv))
>> File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 1024, in run
>> kw = self.parse(cmd, argv)
>> File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 1049, in
>> parse
>> return dict(self.parse_iter(cmd, kw))
>> File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 1058, in
>> parse_iter
>> yield (key, self.Backend.textui.decode(value))
>> File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 136, in
>> decode
>> return value.decode(encoding)
>> File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
>> return codecs.utf_8_decode(input, errors, True)
>> UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 5:
>> invalid start byte
>> ipa: ERROR: an internal error has occurred
>>
>> I'm sure there is a lot more places in the code where things will break
>> when you feed them arbitrary data.
>>
>>>
>>> We treat the python type str as binary data. Anything that is a str gets
>>> based64 encoded before json or xml-rpc transmission.
>>>
>>> The type unicode is considered a "string" and goes in the clear.
>>>
>>> We determine what this type should be not from the data but from the
>>> schema. This is a big assumption. Hopefully this answer's Petr's point
>>> as well.
>>>
>>> We decided long ago that str means Binary and unicode means String. It
>>> is a bit clumsy perhaps python handles it well. It will be more clear
>>> when we switch to Python 3.0 and we'll have bytes and str instead as
>>> types.
>>
>> Well, this is all super-obvious and I'm not really sure why do you bring
>> it up here...
>>
>> My concern is that with your approach, you can no longer be sure that an
>> attribute value is a valid unicode string. This creates a
>> maintainability burden, as you *always* have to keep in mind that you
>> first have to check whether the value is valid or not before touching
>> it. You can easily forget to include the check in new code and there is
>> a lot of places in old code that need to be changed because of this (as
>> you can see in the mail example above).
>>
>> Let me quote John Dennis
>> (<http://www.redhat.com/archives/freeipa-devel/2012-January/msg00075.html>):
>>
>>
>>> I'm very interested in this work. I too have recognized that we have a
>>> very real structural problem with how we handle the types we use
>>> internally, especially when it corresponds to an external type and
>>> requires conversion. In fact we do a remarkably bad job in this area, we
>>> have conversions which are done ad hoc all over the place. There is no
>>> well established place for the conversions to occur and it's difficult
>>> to know at any point in the code what type to expect.
>>
>> Your patch adds yet another occurence of "it's difficult to know at any
>> point in the code what type to expect". IMO this is very bad and we
>> should not do this at all in new code.
>>
>> I'm not sure if just dropping bad values (as I have suggested before) is
>> the best idea, but IMHO anything is better than letting these bad values
>> into the bowels of IPA.
>>
>>>
>>>>> We are trusting that the data in LDAP matches its schema. This is just
>>>>> belt and suspenders verifying that it is the case.
>>>>
>>>> Sure, but I still think we should allow any valid unicode data to come
>>>> from LDAP, not just what is valid in XML-RPC.
>>>
>>> This won't affect data coming out of LDAP, only the way it is
>>> transmitted to the client.
>>
>> It will affect the data flowing through IPA, after we get it from LDAP,
>> before it is transmitted through XML-RPC.
>>
>> In other words, I think it is more correct to do this:
>>
>> LDAP -> check unicode validity -> (rest of IPA) -> check XML-RPC
>> validitity -> XML-RPC
>>
>> than this:
>>
>> LDAP -> check unicode validity -> check XML-RPC validity -> (rest of
>> IPA) -> XML-RPC
>>
>> (Let's keep things isolated please.)
>>
>>>
>>> rob
>>
>> Honza
>>
>
> I'm using a much narrower scope. I'm not trying to make it easy to
> manage non-printable characters, just not blow things up if they exist.
> Limiting to the XML-RPC supported set is for convenience, these are
> unprintable characters in any context. This is just a two-fer.

The fact the characters are non-printable doesn't really mean anything 
until they are actually printed. Characters U+007F to U+009F are 
non-printable as well, yet you treat them as printable in the patch. 
What makes them so special?

>
> Petr was right, I need to encode to unicode before doing this comparison
> in order to compare invalid unicode characters so I moved that first.
>
> I added a very basic unit test.
>
> If you're wondering where this data might come from, two ways are via AD
> sync and migration.

Slightly unrelated, but is it possible that strings in different 
encoding than UTF-8 might come from there?

>
> Yes, the user will be left in a situation where they'll need to use
> --setattr or ldapmodify to manage the data in the field. The UI doesn't
> show the value at all but instead shows [object Object] (no errors in
> console, not sure where this is coming from). It is possible to
> highlight this and insert a new value though.
>
> rob

Again, things will break if you let arbitrary binary strings in IPA. 
Consider this:

 >>> u'unicode' + '\xffgarbage'
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: 
ordinal not in range(128)

When you do an operation on unicode and str operands, the str operand is 
first UTF-8 decoded. This will fail if the str operand doesn't contain 
valid UTF-8, as you can see in the example.

There are many places in the code where such operations take place and 
so far we assumed they won't fail, because there was no way invalid 
UTF-8 strings could reach them. Your patch changes that, so you have to 
fix this problem and make sure new code doesn't suffer from it as well. 
This might turn out to be hard, hence my reluctance to your approach.

I might be repeating myself, but IMO printing a warning to the user that 
a bad value was encountered and ignoring the value in the rest of the 
framework should work for everyone. By that I mean something like this:

$ ipa user-find jdoe
ipa: WARNING: invalid 'email' value encountered: 'ff6230726b6564'
--------------
1 user matched
--------------
   User login: jdoe
   First name: John
   Last name: Doe
   Home directory: /home/jdoe
   Login shell: /bin/sh
   UID: 526
   GID: 526
   Account disabled: False
   Password: False
   Kerberos keys available: False
----------------------------
Number of entries returned 1
----------------------------

Honza

-- 
Jan Cholasta