Request for test data based off of obfuscated live data

Toshio Kuratomi a.badger at gmail.com
Wed Nov 19 18:18:17 UTC 2008


John Palmieri wrote:
> ----- "Toshio Kuratomi" <a.badger at gmail.com> wrote:
> 
> <snip>
>> Getting koji data munged and transferred may be a problem as it is
>> just
>> so darn big.  If we don't have to make changes to the data in koji,
>> just
>> get it distributed, then we could give access to a backup... but
>> that's
>> still a lot of information to transfer.
> 
> We would only need a portion of the data.  Ideally everything since the last supported version of each distribution (or one after so we get obsolete data to test against) but in reality the last month of activity should be suitable.
> 
This gets us into the realm of figuring out what we can delete from the
entire koji data store which seems like a big can of worms.  Some things
like usernames have to be in their entirety.  Other things like builds
can be less than the entirety but since there's dependencies between
builds it wouldn't be a simple remove everything before this timestamp.

It gets us back into munging the koji data which is what I think we
should be avoiding.

>> pkgdb, fas, and bodhi are relatively small.
>>
>> fas is where we'd have our major security problems.  We can't give
>> the
>> information out unmunged.  I've munged it before, though, so it's
>> doable.  How strict we need to be is an issue, though.  If we remove
>> all
>> the identifying information in the people table except for the
>> userid,
>> is that sufficient?  *Note: We probably also need to munge data in
>> the
>> configs table.
> 
> As long as we randomly generate data for that (well username at least).  Note that UID's are easily mapped back to usernames so you might want randomize that.  Also I believe packagedb and bodhi use usernames as the key instead of UID's so those would have to match accounts in the munged FAS db.  I would suggest generating a list of names from a dictionary and using that list to randomize names in the other services.  Of course the names need to correspond to group permissions so some logic would be needed to make sure records associated with a give name are valid.  However having the ability to recreate the associated user names may not be an issue since all of that data is public.  More importantly we need to make sure we aren't giving out addresses, phone numbers, password hashes and other such keys.
>
pkgdb uses userids in the db.  Bodhi and koji use usernames.  I'm
migrating pkgdb to usernames (internally right now; the db and public
facing APIs for 0.4)

If we have to munge usernames that makes things harder as we can't just
dump the koji and bodhi dbs but also have to post-process them.  (Note:
usernames are another thing that the privacy policy allows us to give  out.)

>> pkgdb and bodhi don't have information that is privacy policy
>> sensitive.
>>  (Which doesn't mean that some users won't like it... just that I
>> think
>> we're covered.)
> 
> Mike's suggestion of running it by legal sounds like the best route. 
>  
Running it by legal just to be sure we're doing the right thing is good
although we do have a list of things that we are allowed to have public
per the privacy policy and a pretty good criteria for deciding on other
data.  I'm commenting more on the perception aspect rather than the pure
legal obligation.  And not saying I think it's going to be a problem
just that we should be prepared for a few complaints even if it's
perfectly legal.

-Toshio

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/fedora-infrastructure-list/attachments/20081119/d85be518/attachment.sig>


More information about the Fedora-infrastructure-list mailing list