[T(A)ILS-dev] Third GSoC proposal for the meta-data anonymiz…

Delete this message

Reply to this message
Autor: intrigeri
Data:  
A: tails-dev
CC: tech, Tor Assistants, garrett.f.robinson
Assumpte: [T(A)ILS-dev] Third GSoC proposal for the meta-data anonymizing toolkit project
Hi Tails developers and other Tor/EFF involved people,

A third student has submitted a proposal aimed at implementing our
meta-data anonymizing toolkit during GSoC. Here it is. I am going to
review it and ask questions to Garrett ASAP. I doubt we'll have time
for many proposal / feedback cycles, considering this proposal arrived
very late :/ I'm nevertheless forwarding his proposal here as Garrett
agreed on IRC to discuss his proposal over public channels such as the
tails-dev mailing-list.

Meta-data anonymizing toolkit for file publication
Garrett Robinson

Email: garrett.f.robinson@???

Short description: A tool that helps users remove identifying metadata
from a variety of commonly used document formats, with an API that
makes it possible to add support for more filetypes easily.

Abstract

I propose to create a tool that assists users in removing identifying
metadata from files they wish to send anonymously over the Tor
Network.

1. The Project

The basis for this project is section m. of of the “Project Ideas”
page. This project will be focused on creating a cross-platform,
accessible, and extensible tool to remove identifying metadata from
various file formats before transmitting them over the network.
Multimedia submissions from activists, such as cell phone/digital
camera photographs, audio and video, as well as common document
formats like Microsoft Office/OpenOffice/LibreOffice and Adobe PDF can
contain information about the devices and identities of those who have
created, edited, or possibly even just viewed them. This information
could theoretically be used, sometimes even quite easily (see the
recent cree.py project), to compromise the anonymity of these parties.
Although information concerning this issue has been mentioned all over
the web, there are only a handful of open-source tools available for
editing metadata, and none dedicated to this purpose.



Tasks:

   1. Research methods of identifying individuals from file metadata.
      The viewing and editing of some are obvious, such as the
      “Author” and “Last Saved by” fields in Microsoft Office
      metadata. Some are less so, such as in MP3 watermarking.
   2. Research existing tools the are used to edit metadata. There are
      numerous proprietary tools designed for lawyers, law enforcement
      and so on. Some open source tools that I found through research
      are Phil Harvey’s ExifTool,
   3. Identify a set of file formats the will be processed by the
      project by the end of the summer. Current ideas are: Office
      document formats, PDF, Image Formats (jpg, png, gif, tiff, raw)
      and Camera Formats (many are proprietary).
   4. Design an API or similar framework to allow new file formats to
      be supported by the program without needing to add to the source
      code.
   5. Write the code. I envision this program being available both as
      a command line tool and as a simple GUI. Focus on the command
      line program first, and build the GUI on top of it.




Timeline:

Pre-SoC: Research identifying issues with metadata. Pick a few key
formats to focus on and research them, find specifications, figure out
how they store metadata and how it can be safely altered or removed.
Pick one random format and write a program to clear its metadata, to
get an idea of the challenges and process involved.

Week 1-2: Design a specification for file metadata. How can we create
“rules” for locating and removing metadata depending on filetype?

Week 3: Begin writing a command line program that can use a filetype
specification to clean metadata from documents of that type. Focus on
getting this to work with one filetype specification and just a
handful of essential command line options. Also investigate
verification - how can I be confident there is no metadata left in the
file, or that the program has worked?

Week 4: Expand command line options to include useful features like
batch processing, verbose/force options, and viewing of metadata as
well as deleting it.

Week 5-6: Add more filetype specifications to the program, and test
them on a variety of inputs. Focus on a few core document types:
Office Documents, PDF, and Image Files seem to be the most important.

Week 7-8: Develop a GUI for the software that will make it accessible
to those not working on the command line. The GUI should allow files
to be selected, metadata to be viewed, and then cleared, with various
options such as choosing a location for the cleared copy, etc.

Week 10: Address portability and packaging.

Week 11: Hunt down bugs and squash them.

Week 12: Documentation, specification, plan for future maintenance.

1. Code Sample

The code from many of my recent projects can be found on my github at
https://github.com/handsomeransoms. walkproj is a good example of my
Python (Django) coding. A good example of a complete Python project of
mine is XenLabs, which may be found here:
http://www.cs.oberlin.edu/~xenlabs/

1. Why Tor/EFF?

Tor really demonstrates the incredible opportunity for open source
software to enable activism and powerful social change. I have been
working on a project related to secure, anonymous whistle blowing for
several months, and one of the key elements of our framework to
protect sources is their use of the Tor software. The Tor Project has
helped not only my project, but many others around the world with an
essential task. I would like to give back to the program by using my
skills to contribute to Tor.

1. Experiences

I spent the last summer working on XenLabs, an educational tool for
computer security classes. I inherited the project and some code from
previous students, and communicate with them via e-mail throughout the
summer to discuss design and resolve some tricky issues. I also worked
with my professor, Benjamin Kuperman, to design a program that would
be useful from the perspectives of both students and faculty. We also
worked on debugging together, testing the software in various
scenarios and hunting down bugs. The software was written in Python
and uses Xen, an open source virtualization program. I joined the Xen
mailing lists and IRC channel, asked a lot of questions and even
contributed a few solutions to common problems, although no code. The
Xen community was extremely helpful and my project would likely have
been derailed several times without their wisdow. The software is
available under the GPL here: We have been using successfully in a
class of 25 students this semester. It will be used in future class at
Oberlin College and is currently being implemented by SUNY Oswego as
well.

1. Summer Commitments

I am committed to working part time on the whistle blowing project
mentioned earlier this summer. I am one of a group of 3-4 developers,
so my time commitment will not be severe. I am anticipating working
approximately 20 hours a week on this project. I am also planning to
travel some later in the summer, and take up to two weeks off in at
the end of August to go backpacking with my brother.

1. Project status after the summer

My goal for the summer is to have the command line version of the
program complete and ready to use. I also plan to have made serious
progress on the GUI - an ideal goal would be to complete one, but in
my experience there are a lot of kinks to be worked out when designing
effective GUIs for cross-platform applications. After the summer, the
program will probably benefit from improvements to the GUI and adding
support for additional file types (via the API, so anybody can easily
contribute). I would be happy to continue to work on and maintain the
project for as long as it is useful - I really want to get it out
there and into the community if there is a desire to see something
like this! I believe that my C skills would need improvement before I
could make contributions of value to Tor itself, but I would love to
be at that point someday. I am also going to continue working on the
whistle blower project mentioned earlier, and Tor will undoubtedly
continue to be an important component of our work.

1. Mentoring and Management

I forsee this project as being a relatively independent endeavor. Of
course I would highly value the guidance of a mentor, but I do not
think I would need to be closely supervised. I am highly motivated, a
dedicated worker, and a careful and thorough problem solver. There is
also no existing code base, which would ease some of the difficulties
and requirements for oversight related to working on an existing
project.

1. Education

I attend Oberlin College, where I am majoring in Computer Science. I
am graduating at the end of May.

1. Contact Information

I can be reached by:

E-mail: garrett.f.robinson@???

IRC: grobin, I’ll be idling in #tor and #tor-dev

Bye,
--
intrigeri <intrigeri@???>
| GnuPG key @ https://gaffer.ptitcanardnoir.org/intrigeri/intrigeri.asc
| OTR fingerprint @ https://gaffer.ptitcanardnoir.org/intrigeri/otr.asc
| Do not be trapped by the need to achieve anything.
| This way, you achieve everything.