Re: [T(A)ILS-dev] Third GSoC proposal for the meta-data anon…

Delete this message

Reply to this message
Author: intrigeri
Date:  
To: garrett.f.robinson
CC: tech, Tor Assistants, tails-dev
Subject: Re: [T(A)ILS-dev] Third GSoC proposal for the meta-data anonymizing toolkit project
Hi Garrett, hi Tails/Tor/EFF folks,

Garrett, here is my first bunch of questions to you. I aim at helping
you improve your proposal during the very small amount of time that's
left.

>    2. Research existing tools the are used to edit metadata. There are
>       numerous proprietary tools designed for lawyers, law enforcement
>       and so on. Some open source tools that I found through research
>       are Phil Harvey’s ExifTool,


Sounds like you did not finish this sentence.
What do you think of the Hachoir [0] library?

[0] http://bitbucket.org/haypo/hachoir/wiki/hachoir-parser

>    3. Identify a set of file formats the will be processed by the
>       project by the end of the summer. Current ideas are: Office
>       document formats, PDF, Image Formats (jpg, png, gif, tiff,
>       raw) and Camera Formats (many are proprietary).


I mostly agree with this list of file formats; can you please state in
what order you plan to implement those? I'd like to read your goals as
part of your proposal.

>    4. Design an API or similar framework to allow new file formats to
>       be supported by the program without needing to add to the source
>       code.


I do appreciate you take at heart the future extensibility of the
program.

But... I am unsure about the "without needing to add to the source
code" part. I might be misunderstanding what you mean. On the one
hand, you're talking of an API; which is by definition intended to be
used by other programs. On the other hand, I more or less understand
an implicit "adding features without programming" behind some parts of
your proposal.

Can you please expand a bit on this topic?

In case you really mean "adding features without programming", as the
"specification for file metadata" + "rules" parts of your proposal
seem to indicate, I think it is a bit over-engineered. Don't you fear
designing and implementing this can eat a bit too much of the three
months? Do you really think you can achieve that during the first
three weeks? Your code samples don't display the kind of experience I
think is needed to attack this problem is such an abstract way with
confidence the results will be practically useful.

>    5. Write the code. I envision this program being available both
>       as a command line tool and as a simple GUI. Focus on the
>       command line program first, and build the GUI on top of it.


Do you really mean building the GUI on top of the command-line
program? In my experience, doing so is pretty hard and inelegant once
you take error handling into account, considering the limitations a
command-line program encounters when it needs to express in a
structured way how/why a failure happened.

If the command-line tool was already existing, I may agree the design
you are proposing is the way to go. But this is not the case, and you
are given the chance to e.g. put most of the code into a library that
could be used by both the command-line tool and the GUI, which would
probably make your code much easier to unittest and the GUI much
easier to implement.

What do you think?

> Timeline:


> Pre-SoC: Research identifying issues with metadata. Pick a few key
> formats to focus on and research them, find specifications, figure
> out how they store metadata and how it can be safely altered or
> removed. Pick one random format and write a program to clear its
> metadata, to get an idea of the challenges and process involved.


Ok. I'll add some notes bellow on other things that you may want to
deal with at this time.

> Week 1-2: Design a specification for file metadata. How can we
> create “rules” for locating and removing metadata depending on
> filetype?


I already expressed my bad gut feelings about the "filetype
specification + rules" thing so I won't repeat myself here.

> Week 3: Begin writing a command line program that can use a filetype
> specification to clean metadata from documents of that type. Focus on
> getting this to work with one filetype specification and just a
> handful of essential command line options. Also investigate
> verification - how can I be confident there is no metadata left in the
> file, or that the program has worked?


Initial focus on supporting one file format to start with: great.

Great to see you thought of verification; however, when do you plan do
go further than "investigate" on this topic?

> Week 4: Expand command line options to include useful features like
> batch processing, verbose/force options, and viewing of metadata as
> well as deleting it.


Deleting or replacing it with a common data set shared by all users of
this toolkit, so that the biggest possible Anonymity Set is created?
See the tails-dev mailing-list archives in the last few days for
recent discussion about this topic.

> Week 5-6: Add more filetype specifications to the program, and test
> them on a variety of inputs. Focus on a few core document types:
> Office Documents, PDF, and Image Files seem to be the most important.


> Week 7-8: Develop a GUI for the software that will make it accessible
> to those not working on the command line. The GUI should allow files
> to be selected, metadata to be viewed, and then cleared, with various
> options such as choosing a location for the cleared copy, etc.


> Week 10: Address portability and packaging.


I'm not comfortable with addressing such matters this late.

Portability (quoting myself - tails-dev, few days ago): "[...] for the
Windows and OSX support: in my humble opinion, if you really want this
requirement to be part of your project, you need to carefully choose
the libs you use with this in mind right from the start. Starting to
port the code to these platforms after 1.5 months spent writing it is,
IMHO, very much optimistic."

Anyway. I'd like to read what your portability and packaging plans
are.

I'll quote myself on the packaging topic too, since I explained this
to another student on the tails-dev mailing list a few days ago.
Please note that this GSoC project idea was submitted by Tails; if
your proposal is accepted, your mentor will be a Tails developer; a
goal could be to write a tool that we can install into Tails at the
end of the summer. So:

  1. In order to be properly installable in Tails, your set of tools
     will need to be packaged for Debian, i.e. we should at least be
     able to prepare custom .deb packages from your code. You do not
     necessarily need to plan preparing the Debian packages yourself,
     especially if you don't know Debian packaging yet (do you?). Nor
     do you need to care that much about having the software uploaded
     to Debian: filing a RFP bug should be enough from your side.
     Well, I'd be delighted if you would integrate these tasks as part
     of your summer schedule if you feel this is realistic, but don't
     worry too much if you feel you can't.


  2. In order to make sure your code can be used in Tails, making sure
     you pick tools and libraries that are available in Debian Squeeze
     (+ squeeze-backports) is a must and should not be a late
     requirement. Also, when do you plan to start testing your code
     inside of Tails? I suggest doing this on a regular basis, e.g.
     once a week or once every two weeks, to make sure you don't go a
     wrong way for too long.


> Week 11: Hunt down bugs and squash them.


> Week 12: Documentation, specification, plan for future maintenance.


Design and/or end-user documentation?

Specification == ??? (the only way I understand it, it does not make
sense, so I am probably misunderstanding).

>    1. Project status after the summer


> My goal for the summer is to have the command line version of the
> program complete and ready to use. I also plan to have made serious
> progress on the GUI - an ideal goal would be to complete one, but in
> my experience there are a lot of kinks to be worked out when
> designing effective GUIs for cross-platform applications.


I'd rather see you give up a few tasks (e.g. supporting less file
formats initially, less operating systems, or giving up the "adding
support for new formats without programming feature") in order to have
enough time to deliver a GUI we can ship in Tails at the end of the
summer. To be honest, a common pitfall of GSoC projects is to be left
in a promising but not practically usable state at the end of the
summer. While you seem eager to go on working on this toolkit after
the GSoC period is over, I'd rather not depend on this.

> After the summer, the program will probably benefit from
> improvements to the GUI and adding support for additional file types
> (via the API, so anybody can easily contribute). I would be happy to
> continue to work on and maintain the project for as long as it is
> useful - I really want to get it out there and into the community if
> there is a desire to see something like this!


Glad to hear this!

>    1. Mentoring and Management


> I forsee this project as being a relatively independent endeavor. Of
> course I would highly value the guidance of a mentor, but I do not
> think I would need to be closely supervised. I am highly motivated,
> a dedicated worker, and a careful and thorough problem solver. There
> is also no existing code base, which would ease some of the
> difficulties and requirements for oversight related to working on an
> existing project.


I fully understand you were not aware this was a project idea proposed
by Tails. I'd now like to know how much you are interested into taking
into making the "be installable in Tails at the end of the summer"
goal part of your proposal, as I have been suggesting in different
places of this email.


Time for random questions.

What language do you intend to use to implement this project? All of
your code samples are in Python, so I guess this is it, but IMHO it
should be made explicit. Is this the language you are the most at ease
with? Have you ever written (or participated in writting) bigger
Python programs than the ones you mentioned? (I could understand many
reasons why you could be unwilling / unable to show such programs to
us, but I'd like to know a bit more about your past programming
experience if you don't mind.)

What toolkit do you intend to use for the GUI? Please note that using
PyGTK would ease integrating it into Tails.

You don't mention l10n/i18n as part of your proposal. Could you please
consider making it explicit the code will be l10n/i18n-ready?

Do you plan to write unittests during the development process? If
so, when?

Bye,
--
intrigeri <intrigeri@???>
| GnuPG key @ https://gaffer.ptitcanardnoir.org/intrigeri/intrigeri.asc
| OTR fingerprint @ https://gaffer.ptitcanardnoir.org/intrigeri/otr.asc
| So what?