Re: [T(A)ILS-dev] Another GSoC proposal for the meta-data anonymizing toolkit

Author: julien.voisin
Date:
To: tails-dev, tor-assistants
Subject: Re: [T(A)ILS-dev] Another GSoC proposal for the meta-data anonymizing toolkit

>Hi Julien, hi fellow Tails developers!
>
>Foreword: I'm intrigeri, one of the core Tails [0] developers.
>We are participating in this year's GSoC under the EFF/Tor umbrella.
>The project you proposed for GSoC is one of the project ideas we
>had added [1] to the Tor Project's "Volunteer" page.
>
> [0] https://tails.boum.org/
> [1] https://www.torproject.org/getinvolved/volunteer.html#project-tails
>
>I am one of the Tails developers willing to be a mentor for
>Tails-related GSoC projects, and am likely to be your mentor in case
>your proposal is accepted by Google (as you probably know, their
>choice is heavily hinted by us actually).
>
>Welcome aboard!
>
>I've carefully read your proposal, and am now starting the proposal /
>feedback cycle we usually go through with students before the proposal
>submission period is over. As you submitted your proposal to a
>non-public place, I am not replying to it on a public place either:
>the amnesia@??? mailing-list I am sending this email to (and
>using as a remailer, by the way) is the private, OpenPGP-encrypted
>mailing-list [2] whose members are core, trusted Tails developers. On
>the other hand, I'd prefer to have this discussion on our public
>mailing-list (tails-dev@??? [3]) so that the Tails "community"
>can be involved in this process; additionally Cc'ing other involved
>people (mostly tor-assistants@???) would be great as well.
>If you don't mind, please let us know and reply there; if you do mind,
>I'm perfectly fine with it.

Thank you for your encrypted reply.
Since the 8th is approching,
I've no more complex about posting on the mailing list.

>
> [2] https://tails.boum.org/talk-dev/
> [3] https://boum.org/mailman/listinfo/tails-dev
>
>> I am interested to work on the “Meta-data anonymizing toolkit for
>> file publication” project. I know there is already a student
>> interested in by this project, but I really want to do it : I needed
>> it for my own and have already thought about a potential design some
>> time ago.
>
>Glad to read this. Have you read the discussion (archives: [4] and
>[5]) we already had with the other student?
>
> [4] https://boum.org/pipermail/tails-dev/2011-March/000222.html
> [5] https://boum.org/pipermail/tails-dev/2011-April/000228.html

Yes, I did

>
>Also, we've written a specification for this tool some time ago.
>You'll find it as an attachment to this message. You don't have to
>consider this specification as The Rule you Have To Obey, but it might
>help you understand the rationale and practical requirements we had in
>mind when adding this project to the Tor "Volunteer" page.
>
>Your proposal seems pretty close to this specification, but it misses
>the "secure-deletion after cleaning" feature. What do you think of it?
>Do you think you could implement the lib/cmd-line/GUI with this
>(possibly future) feature in mind? (Please don't feel compelled to add
>it to your proposal.)

I don't have thought to secure removal,
because it's heavily hardware/filesystem dependent.
But I think a binding to smr (Secure Remove) or shred (GNU Core Utilities)
would be fine.

But, I don't plan to implement the "Proprietary file needing conversion",
since I don't think that it's the job of a "metadata-cleaner".

>> I would like to work for the EFF, because I am very concerned about
>> privacy issues on the Internet. I think privacy is an essential
>> right, and not just an option. Especially, I would really enjoy to
>> work for the Tor project. I am using it for quite some time and
>> would like to get more involved and contribute back!
>
>If your proposal is accepted, you'll mainly work with Tails folks. We
>are now almost-officially considered to be a Tor project [6], so I
>guess this won't make that much difference for you, but I feel the
>need to make it clear soon enough. Is this a problem for you?
>
> [6] https://www.torproject.org/projects/projects.html.en

Nop, I isn't, but thank you for the clarification.

>> I use F/OSS on a daily basis (Ubuntu, Debian, Archlinux and Gentoo).
>
>Great.
>
>Tails being based on Debian Squeeze, would you mind making
>"compatibility with Debian Squeeze + official squeeze-backports" a
>formal goal of your GSoC project?
>
>Since the app will be coded in pure python,
>it will run on any platform with Python.
>But yes, I can do more intensive testing to ensure the compatibility.
>
>> So far my major contributions were the writing of documentations for
>> archLinux, openmw, xda-forum and Ubuntu. Recently I have released a
>> little matrix manipulation library written in C, originally for an
>> academic project (http://dustri.org/lib/).
>
>I'm no expert at C, so I've asked other Tails developers to review
>this code.
>
>> I do not have any major plan for this summer (but my holidays only
>> begins the june 4th), so I can fully focus on the project and
>> reasonably think that I could commit 6 hours per day on it.
>
>Fair enough.
>
>> Requirement/Deliverables: >[...] >> o Let the user delete/modify a specific meta

>
>Are you sure about this one? My intuition tells me it could make the
>implementation much harder and the GUI much more complex.

You're right : actually, it doesn't seems to be a nice idea anymore:
too much random and time-consuming.

>> I’d like to do this project in Python, because I already have done
>> some personal projects whith it
>
>Fine with me. I'm no Python expert but I feel competent enough to be a
>proper mentor.
>
>> (for which I also used subversion) :
>
>We use Git for Tails, but I would not mind using git-svn to fetch and
>review your work. On the long run, maintenance could more easily be
>shared using Git's distributed features if you can deal with it, but
>well, if you don't know Git yet, feel free to forget about it for the
>time being.

I can learn git, it's not a big deal.

>> [...] a battery monitor, a simple search engine indexing FTP
>> servers, ...
>
>Can we by chance see this code?

I have lost the code of the ftp crawler, and the battery monitor
was pretty small (<100 lines), so I don't think it's realy relevant.

>Do I understand clearly you have never been producing and maintaining
>Free Software yet? It's no blocker, but being aware of it can help you
>being clear with your starting-point, and us be better mentors.

You're right.

>> Meta reading/writting library :
>
>> A library to read and write metas for various file formats. The main
>> goal is to provide an abstraction interface (for the file format and
>> for the underlying libraries used). At first it would only wrap
>> Hachoir. Why hachoir :
>[...]
>> But we could also wrap other libraries to support a particular file
>> format.
>
>Wouldn't it be better to add support to Hachoir itself (possibly
>using such an external library) for file formats you want to support
>but not supported yet?

It's not exclusive. If someone as already done a lot of work in another
library
than hachoir, with this design, its would be easy to add it into the tool.

>> Or write ourself the support for a format, although this should be
>> avoided if possible (it looks simple at first, but supporting
>> different versions of the format and maintaining the thing over time
>> is extremely time consuming)
>
>Full ACK.
>
>> Hazard identification library:
>
>> The aim is to categorise the dangers associated to a metadata. Every
>> meta can pause a different level of hazard. Some field are of
>> absolutely no threat to the anonymity, some might contain hazardous
>> and some fields can for sure compromise anonymity (ex : GPS
>> coordinates in EXIF). Based on that we want to inform the user as
>> best as possible.
>
>> When asked about one file, the lib would return the tree of metadata
>> coming from the meta reading library with a flag for each field,
>> possibly :
>
>> SAFE : no danger
>
>> HAZARDOUS : this field might contain dangerous informations
>
>> EVIL : this field does contain for sure dangerous informations >> UNKNOWN : we dont know this field

>
>I'm unsure about this design's practical usefulness.
>
>First, it can be hard (from the application developer point-of-view)
>to guess what kind of data can be a threat to a given user's
>anonymity. Simple and stupid example from the top of my head: let us
>say there is a "Operating System" field in meta-data; it generally can
>be considered as not-that-much dangerous, but could reveal itself to
>be very dangerous in case the user is running a unusual OS such as a
>rare GNU/Linux distribution.
>
>Second, data one particular field may not be a problem it itself,
>while the combination of this data with the one found in another field
>can be. You probably should have a look to the EFF Panopticlick
>website to see how a combination of individually harmless information
>can indeed be pretty much identifying of a given user.
>
>Generally, I'm very doubtful about the "analyze every field
>separately" way of doing, and even more doubtful about providing a
>user interface that would allow editing each individual field.
>
>IMHO, the goal of an anonymity tool shall be to build the biggest
>possible anonymity set, so that every user of this tool is hard to
>distinguish from others. You might want to read a bit about such
>concepts and the theory being them; if you feel more like reading
>practical stuff, the Torbutton Design Documentation [7] may give you
>clues on this topic and be quite less abstract. Providing ways to
>treat every meta-data field separately would help every user to create
>her own "meta-data fieldset" that is likely to be different from other
>users' ones.
>
> [7] https://www.torproject.org/torbutton/en/design/index.html.en
>
>I would prefer the whole meta-data fieldset to be replaced by a set of
>data that would be common to every user of the meta-data anonymizing
>toolkit.
>
>What do you think?
>
>(The deadline being quickly approaching, let's try to find a common
>ground for agreement on the general idea, rather than focusing on
>details.)
>

It make more sense than my idea : there are too much differents fields,
and too much interractions between them to do a nice "fields-analyser".

But I think that the good option would to let the user choose between
"I'm mister nobody"/"I don't want any meta" and a custom sheme.

I don't realy like the principe of "all or none".

But, afterall, the batch-mode "make my meta common" is a priority,
and the "make me custom data" is more accessory.

>> * Two tools using thoses libraries :
>
>> o A GUI (pyQT seems nice)

>
>If you don't mind, using PyGTK would avoid adding a dependency on pyQT
>in Tails, that already depends on the former but does not ship the
>latter. What do you think?

I don't have looked a lot into GUI stuff, so I don't have any preference.

>> However for this project, developing feature by feature seems more
>> appropriate : Starting by a skeleton implementing a thin slice of
>> functionality that traverses most of the layers. For example, I
>> could start by focussing on EXIF metas : make sure that the meta
>> reading/writing library supports EXIF, then do the hazard
>> identification for EXIF, then make the command line tool using the
>> previous libraries.
>
>Full ACK.
>
>> This allows a more incremental development flow and after only a few
>> weeks I would be able to deliver a working system.
>
>Sounds great to my hear.
>
>How do you plan to make it easy for us to test your code (say, every
>week or two) in Real World conditions, i.e. in Tails? If you have any
>experience in Debian packaging, this would be the way to go. Else,
>please let us know and we'll deal with this part of the work.
>
>> Timeline:
>
>> * first three weeks :

>
>> o skeleton with support for EXIF : >> create the meta read/write lib using Hachoir, create the >> threat indentification library interface and add EXIF >> support, begin of the cmdline tool using the libraries. >> o implement the first tests (for EXIF) >> o create the structure in the repository (directories, >> README, ..)

>
>Ok. Only dark spot: "begin of the cmdline tool" seems quite vague to
>me... especially since no other period is scheduled later to finish
>this task.

Since I'm planing to develop the tool feature by feature, I can't garantee
that the cmdline tool will be achieved at the end of the first three weeks.
"Begining" is not the right word, "coding the essentials features" fits
better.

>> * 2 weeks: support of other metadata

>
>I think this shall be detailed a bit more. Four words are a bit quick
>way to describe 2 weeks of work, he.
>
>What file formats do you intend to support initially?
>I think PDF, images, audio and video files are the most important to
>support to start with. What do you think?
>

I think I'll focus on :
- pdf
- mpeg audio
- ogg
- bmp/gif/jpeg/png
- exe
- archives (bzip2, zip, tar)

>> * 3 weeks :Starting of the implementation of the GUI tool >> o at this stage I will have the knowledge from the command >> line tools and enough experience, I can apply it to >> develop the GUI

>
>Same here, I don't like "starting ..." that much. I prefer you to
>commit yourself to well defined tasks of the size and difficulty you
>think you are able to achieve, rather than expressing your goals in
>such an ambiguous way.
>
>And anyway, a more detailed plan for these three weeks would be much
>welcome.

Since I don't have many experiences on GUI (I prefer designing algorithms
than HCI), I don't know precisely how to detail this part.

Maybe something like :
- Design the interface
- Creation of the interface
- Implementing the missings features for the interface into the
meta reading/writing lib
- Link the interface to the lib

>> * 1 week : emphasis on the unit test >> o For such a critic tool (the smallest crack could >> compromise the user), the testing should be bulletproof ! >> So I’m planing to focus on it one week long.

>
>I'm not sure about this one, although I do like your emphasis on
>robustness and unit testing.
>
>On the one hand the first three weeks schedule seems to indicate you
>intend to implement the tests roughly at the same time as the tested
>code ("implement the first tests (for EXIF)"), which I like very much.
>On the other hand you schedule one full week dedicated to unit testing
>at two-third of the coding period. It seems to me you do not need to
>spend one week implementing tests at this point, if they have been
>properly written {before, while, soon after} implementing the tested
>code. What do you think?
>
>> * Remaining weeks: cleanup, bugfixing, integration work, final >> documentation

>
>Three weeks seem like quite long for this, but this might be because I
>failed to see the exact scope meant behind your words. Mainly "integration
>work" and "final documentation" may require very little or very much
>time, depending on what you mean:
>
> - end-user and/or design documentation? (I'd rather design > documentation to be written before/while/soon-after every step.) > - integration work == ? Packaging for Debian (very useful for Tails) > and other distributions? Packaging for foreign operating systems > (read: Windows, OSX; beware that this does not take much more time > than expected, to the detriment of other planned tasks)? Did you > mean anything else than that?

I'd like to keep the three remaining week in case of problems,
or missings features, so I am sure I'll be able to deliver a

>> Every Week-end : documentation time!
>
>End-user documentation and/or design documentation?

Both.

-design :
I'd like to review my code frequently do document it,
and to correct typo and other dumb stuffs, in order to produce
clean and readable code.

-end-user :
It's more easy to document the code after beeing sure it works (so, not
while developping/testing it), but not to late either.

>> and a blog-post about what I have done in the week.
>
>Great. An email to tails-dev would be most welcome as well
>(copy-pasting your blog post would probably be enough).
>
>> As for what I expect from my mentor, I think he should try to be
>> available when I need him specifically (e.g. technical questions no
>> one else on IRC can answer) but he doesn’t need to check on me all
>> the time.
>
>You need to know I spend most of my time offline, especially in the
>summer. Therefore I mostly communicate over asynchronous media such as
>email, which I generally read and reply every day => round-trip time
>generally less than 24h. Knowing this, how do you see things? What can
>be a reasonable way to make your "using irc quite a lot" and my
>"mostly using email" fit together joyfully? E.g. we could additionally
>formally schedule IRC or XMPP meetings on a regular basis (say 45min
>every week) so that we can discuss things more smoothly.
>
>As a final question, what are your plans for the "Community Bonding
>Period" [8] (April 25 - May 23)? Google describes this as "Students
>get to know mentors, read documentation, get up to speed to begin
>working on their projects."
>
> [8]
http://googlesummerofcode.blogspot.com/2007/04/so-what-is-this-community-bonding-all.html

I'd like to begin reading some documentation about GUI in Python,
and to do some preliminary work:
- thinking about problems I may encounter
- doing prototypes/testing of my implementation ideas
- exchanging some ideas with the community/my mentor
- getting in touch with Hachoir lib

>Bye, take care!
>--
> intrigeri <intrigeri@???>
> | GnuPG key @ https://gaffer.ptitcanardnoir.org/intrigeri/intrigeri.asc
> | OTR fingerprint @ https://gaffer.ptitcanardnoir.org/intrigeri/otr.asc
> | Every now and then I get a little bit restless
> | and I dream of something wild.
>
>Hi again,
>
>Julien, I forgot to ask:
>
>Will your project need more work and/or maintenance after the summer
>ends? What are the chances you will stick around and help out with
>that and other related projects?

I think I could deliver a finished product after the GSoC.
And I'd like to stay around and contribute more after that!

>Bye,
>--
> intrigeri <intrigeri@???>
> | GnuPG key @ https://gaffer.ptitcanardnoir.org/intrigeri/intrigeri.asc
> | OTR fingerprint @ https://gaffer.ptitcanardnoir.org/intrigeri/otr.asc
> | So what?
>
>
>--

I had a realy long day today, so please forgive the mistakes if they are
any.

Have a nice day,

VOISIN Julien

This message is part of the following thread:
	the complete thread tree sorted by date

	intrigeri at

Re: [T(A)ILS-dev] Another GSoC proposal for the meta-data an…