Re: [T(A)ILS-dev] Another GSoC proposal for the meta-data anonymizing toolkit

Author: intrigeri
Date:
To: julien.voisin
CC: tor-assistants, The T\(A\)ILS public development discussion list
Subject: Re: [T(A)ILS-dev] Another GSoC proposal for the meta-data anonymizing toolkit

Hi Julien, hi fellow Tails developers!

Thanks for your quick reply. I think the next thing to do is to update
your proposal on the Google Melange website accordingly to the results
of this discussion... and reply the next bunch of questions I am
asking bellow.

> I don't have thought to secure removal,
> because it's heavily hardware/filesystem dependent.
> But I think a binding to smr (Secure Remove) or shred (GNU Core Utilities)
> would be fine.

I'll let you decide whether you want to make this feature an
{important, optional} part of your proposal. Anyhow, I'd be glad to
see it mentioned at least as a desirable future improvement, just to
make sure it is taken into account when designing the lib and apps.

> But, I don't plan to implement the "Proprietary file needing
> conversion", since I don't think that it's the job of a
> "metadata-cleaner".

Fair enough.

>>Tails being based on Debian Squeeze, would you mind making
>>"compatibility with Debian Squeeze + official squeeze-backports" a
>>formal goal of your GSoC project?
>>
>Since the app will be coded in pure python,
>it will run on any platform with Python.
>But yes, I can do more intensive testing to ensure the compatibility.

I see your proposal now explicitly states the "run on Debian Squeeze"
requirement, which is great. Beware: using "pure Python" won't give
this for free. Let me explain you my point a bit.

You've mentioned you intend to use some libraries, such as Hachoir.

As you do know, both Python and libraries tend to add features in
every new release. Some newest releases may not be available in the
Debian Squeeze environment I described => in order to make your
results usable in Tails, you'll need to make sure you don't use any
feature that appeared too late to make it into Squeeze.

Also, not every Python library is packaged into Debian. For
maintainability reasons, I would not like to make Tails dependent on a
Python library that is not part of Debian => in order to make your
results usable in Tails, you'll need to make sure you only use
libraries that are part of Debian Squeeze.

End of explanation. No answer expected.

> I can learn git, it's not a big deal.

Great. Maybe this could happen during the Community Bonding Period so
that you don't spend too much time getting up to speed with the tools
during the actual coding time.

>>Can we by chance see this code?

>>Wouldn't it be better to add support to Hachoir itself (possibly
>>using such an external library) for file formats you want to support
>>but not supported yet?

> It's not exclusive. If someone as already done a lot of work in
> another library than hachoir, with this design, its would be easy to
> add it into the tool.

Ok.

>>I would prefer the whole meta-data fieldset to be replaced by a set of
>>data that would be common to every user of the meta-data anonymizing
>>toolkit.
>>
>>What do you think?

> It make more sense than my idea : there are too much differents fields,
> and too much interractions between them to do a nice "fields-analyser".

Ok.

> But I think that the good option would to let the user choose between
> "I'm mister nobody"/"I don't want any meta" and a custom sheme.

> I don't realy like the principe of "all or none".

> But, afterall, the batch-mode "make my meta common" is a priority,
> and the "make me custom data" is more accessory.

Indeed.

IMHO this shall appear clearly in your proposal, and not only in the
title (which is "Meta-data anonymizing toolkit" rather than "Meta-data
customization toolkit" for reasons).

>>If you don't mind, using PyGTK would avoid adding a dependency on
>>pyQT in Tails, that already depends on the former but does not ship
>>the latter. What do you think?

> I don't have looked a lot into GUI stuff, so I don't have any
> preference.

I see you switched your proposal to PyGTK. Thanks.

>>How do you plan to make it easy for us to test your code (say, every
>>week or two) in Real World conditions, i.e. in Tails? If you have any
>>experience in Debian packaging, this would be the way to go. Else,
>>please let us know and we'll deal with this part of the work.

It seems to me you did not answer this question of mine.

>>> Timeline:
>>
>>> * first three weeks :

>>
[...]
>>Ok. Only dark spot: "begin of the cmdline tool" seems quite vague to
>>me... especially since no other period is scheduled later to finish
>>this task.

> Since I'm planing to develop the tool feature by feature, I can't
> garantee that the cmdline tool will be achieved at the end of the
> first three weeks.

Right. So you need to schedule some time later to finish it, don't
you?

> "Begining" is not the right word, "coding the essentials features"
> fits better.

Well, I feel the need to insist a bit. I understand you'll begin your
implementation with the essential features, which seems fine to me.
What are the command-line tool features you consider to be essential,
what are the non-essential ones, then?

> I think I'll focus on :
> - pdf
> - mpeg audio
> - ogg
> - bmp/gif/jpeg/png
> - exe
> - archives (bzip2, zip, tar)

Fine with me.

>>> * 1 week : emphasis on the unit test >>> o For such a critic tool (the smallest crack could >>> compromise the user), the testing should be bulletproof ! >>> So I’m planing to focus on it one week long.

>>
>>I'm not sure about this one, although I do like your emphasis on
>>robustness and unit testing.
>>
>>On the one hand the first three weeks schedule seems to indicate you
>>intend to implement the tests roughly at the same time as the tested
>>code ("implement the first tests (for EXIF)"), which I like very much.
>>On the other hand you schedule one full week dedicated to unit testing
>>at two-third of the coding period. It seems to me you do not need to
>>spend one week implementing tests at this point, if they have been
>>properly written {before, while, soon after} implementing the tested
>>code. What do you think?

It seems to me you did not answer this question of mine.

>>
>>> * Remaining weeks: cleanup, bugfixing, integration work, final >>> documentation

>>
>>Three weeks seem like quite long for this, but this might be because I
>>failed to see the exact scope meant behind your words. Mainly "integration
>>work" and "final documentation" may require very little or very much
>>time, depending on what you mean:
>>
>> - end-user and/or design documentation? (I'd rather design >> documentation to be written before/while/soon-after every step.) >> - integration work == ? Packaging for Debian (very useful for Tails) >> and other distributions? Packaging for foreign operating systems >> (read: Windows, OSX; beware that this does not take much more time >> than expected, to the detriment of other planned tasks)? Did you >> mean anything else than that?

> I'd like to keep the three remaining week in case of problems,
> or missings features, so I am sure I'll be able to deliver a

Seems like you did not finish your sentence. I presume that's why
I can find no answer to my question about integration work.
I can find your answer about documentation a few lines bellow, though.

>>> Every Week-end : documentation time!
>>
>>End-user documentation and/or design documentation?

> Both.

> -design :
> I'd like to review my code frequently do document it,
> and to correct typo and other dumb stuffs, in order to produce
> clean and readable code.

> -end-user :
> It's more easy to document the code after beeing sure it works (so, not
> while developping/testing it), but not to late either.

Great. Considering you'll review and cleanup your code + write
documentation in such an incremental way, it's likely you won't need
that much time for those tasks at the end of the coding period.

So this is still the part of your schedule that is the most unclear to
me. I appreciate your wish to plan time "in case of problems, or
missings features", but then please state it clearly in your proposal,
rather than listing tasks that will be done already at this time,
according to the development process you are describing. If needed,
please clarify what part of these tasks you plan to do incrementally,
what part you plan to do during these 3 weeks.

>>You need to know I spend most of my time offline, especially in the
>>summer. Therefore I mostly communicate over asynchronous media such
>>as email, which I generally read and reply every day => round-trip
>>time generally less than 24h. Knowing this, how do you see things?
>>What can be a reasonable way to make your "using irc quite a lot"
>>and my "mostly using email" fit together joyfully? E.g. we could
>>additionally formally schedule IRC or XMPP meetings on a regular
>>basis (say 45min every week) so that we can discuss things more
>>smoothly.

??

>>As a final question, what are your plans for the "Community Bonding
>>Period" [8] (April 25 - May 23)? Google describes this as "Students
>>get to know mentors, read documentation, get up to speed to begin
>>working on their projects."
>>
>> [8]
> http://googlesummerofcode.blogspot.com/2007/04/so-what-is-this-community-bonding-all.html

> I'd like to begin reading some documentation about GUI in Python,
> and to do some preliminary work:
> - thinking about problems I may encounter
> - doing prototypes/testing of my implementation ideas
> - exchanging some ideas with the community/my mentor
> - getting in touch with Hachoir lib

Ok. This is a great amount of work you plan to do during this time.
Maybe you would benefit prioritizing these items. E.g. I see you
mention Python GUI programming as something you intend to spend some
time on during the Community Bonding Period. Reading other parts of
this discussion, it seems to me this is entirely novel to you, so
maybe you should consider making it your main technical topic for this
period, vs. other kind of tasks you already feel quite comfortable
with. What do you think?

Other questions:

You mentioned "batch mode to handle a whole directory (or set of
directories)" in your list of deliverables, but I did not see it
anywhere in your schedule. Do you still plan to implement this? If you
do, when?

When you state in your proposal "I prefer to focus on the
algorithmes/efficacity of the tools and the CLI, instead of the GUI.";
are you expressing your personal preference in general, or your
feeling about what is more or less important for this specific
project?

In any case: I do agree the backend library's quality is very
important for any one of the other tools to work properly and be easy
to maintain. On the other hand, I think the GUI is at least as
important as the CLI. I think people used to CLI and manpage reading
are probably *already* able to anonymize meta-data in files, by using
exiv2, pdftk and friends; a consistent CLI interface for doing this
would clearly be most welcome, but please consider the case of other
(!CLI-friends) people who currently have no way to anonymize meta-data
in files.

> I think I could deliver a finished product after the GSoC.
> And I'd like to stay around and contribute more after that!

Glad to hear this :)

Bye,
--
intrigeri <intrigeri@???>
| GnuPG key @ https://gaffer.ptitcanardnoir.org/intrigeri/intrigeri.asc
| OTR fingerprint @ https://gaffer.ptitcanardnoir.org/intrigeri/otr.asc
| Did you exchange a walk on part in the war
| for a lead role in the cage?

This message is part of the following thread:
	the complete thread tree sorted by date
	julien.voisin at
	Robert Ransom at

Re: [T(A)ILS-dev] Another GSoC proposal for the meta-data an…