[Tails-ux] Report on Piwik prototype

Delete this message

Reply to this message
Author: sajolida
Date:  
To: Tails user experience & user interface design, Tails system administrators
New-Topics: Re: [Tails-ux] Report on Piwik prototype
Subject: [Tails-ux] Report on Piwik prototype
Hi,

NB: I'm putting tails-sysadmins@??? in copy because I'm talking
about infrastructure here and feel lazy to write two different reports.

As part of #12562: "Have a web analytics platform" I started playing
with Piwik to do web analytics on the activity on our website.

Installation
============

I first tried the "official" Debian package [1] which is official for
them but not in Debian. It didn't install on Stretch.

[1]: https://wiki.debian.org/Piwik

So I felt back on the ZIP installation which was extremely easy: install
Apache, create a MariaDB database, and unzip. Then all the rest is done
through the web interface.

General impression
==================

Piwik seems to be very mature, professional, well designed, well
documented, etc. They have a solid business model and to make a good use
of it.

It almost doesn't feel like free software. I'm jealous, ha ha!

I don't see ourselves using anything else.

Logs vs JavaScript
==================

Piwik usually gather data through a JavaScript script embedded on the
pages but you can also import Apache logs in it [2]. You do so through a
Python script that generate HTTP request on Piwik to emulate the
activity described in the logs.

[2]: https://piwik.org/log-analytics/

As a start I imported the logs that we download from boum.org. These
logs have no IP addresses.

I missed this explanation in the Piwik doc so I'll tell you a bit more
about how I understand that it works. Piwiki gathers raw data from the
web activity. But it doesn't care much about single "hits" (single HTTP
requests) but almost only manipulate data as part of "visits" (what a
single user did on the website).

A single user is defined by:

    (OS, browser + browser plugins + IP address + browser language)


Then it periodically processes this data to generate "reports" on this
activity (daily, weekly, and monthly reports). Reports are an aggregate
of useful information extracted from the raw visits: statistics on the
page views, the devices, the browser vendors and versions, the operating
systems, languages, etc.

This doesn't really work with the logs we get from boum.org because then
all visitors using Tor Browser are considered as a single visit and all
the subsequent stats are deeply screwed. For example you can't study the
path of a single Tor Browser (thus Tails) user on the website.

If we agree that this is not good enough, I see two ways of singling out
visits of Tor Browser users:

a. Ask boum.org to deactivate IP anonymization. We could import the logs
daily and then get rid of the original logs and rely instead on the IP
anonymization feature of Piwik [3]. It's not a hack but serious stuff
build for users with legal requirements so I expect it to be well
integrated and doing what it should.

Downsides:

- I'm not sure boum.org will be able or ready to have IP in their logs.
- Relying on logs of activity done through the Tor network might not
provide a perfect way of singling out people. For example, I expect
people using the same exit node to visit our website to still be
considered as a single visit.
- We might still want to keep and analyze the raw logs for some
data that Piwik wouldn't provide us. For example, until now I didn't
find how to replace our boot statistics: see the hits on
security/index.en.atom only by libwww-perl. It's probably possible
but I can say it yet. Or to count the hits on the hash tags that I
used to flag the activity related to the donation campaign. But for
this, there are other mechanisms of Piwik to do this even better next
year.

[3]: https://piwik.org/docs/privacy/

b. Rely on the JavaScript. Again we could rely on the IP anonymization
feature of Piwik to keep sleeping at night. It's not clear to me whether
people using the same exit node would be singled out with this technique
(relying on some cookie maybe).

Downsides:

- We won't have analytics from people without JavaScript.
- The JavaScript might not give us all the analytics we need.
For example the hits on the security upgrade feed by Tails Upgrader.

I'm not sure what's best and it would anyway involve a more political
discussion about what information we want from our users. Happy to
gather impression and hints on what would such a discussion imply but
I'm not sure here is the right place to have it.

Resources needed
================

Piwik is quite heavy on resources:

- I'm running it on a dedicated X200 with a Core 2 Duo P8700 and 4 GiB
of RAM

- Importing the logs for a full day of our website's activity takes
about 1.5 hour at full CPU.

- I imported logs from November 10 to January 19 (71 days) and the
database is now 13.0 GiB (from 1.1 GiB of gzip Apache logs).
My understanding is that I generated the reports for this period but
didn't get rid of the raw data from the database. This is possible to
do but is a different process I think.

- Processing all reports for all this data takes several hours,
maybe almost a day. I didn't try to process a report for a single day
only.

Lovely sysadmins
================

I want guidance from the sysadmins team on how to move this forward and
be integrated in our official infrastructure. The sysadmin work I had to
do here was very little:

- Default Apache configuration
- Create a MariaDB database
- Figure out a regex that works to import our logs with imports_logs.py
- Learn some Piwik command line to trigger the archiving of the reports

But on the long run there might be work needed to monitor the
performance issues, tweak for better performance, etc. There is some doc
about that on their website [4].

[4]: https://piwik.org/docs/optimize-how-to/

Wanna try it?
=============

I could give people accounts on the prototype. My Internet connection is
not super fast but you could at least give it a try and see how it looks.

Next steps
==========

- I'll try to do useful stuff through Piwik for #12082 "Analyze the
results of the donation campaign" over the summer.

- I'll see how we could replace our current metrics with Piwik. [#12728]

- I'll lead the discussion on how to single out Tor Browser. [#12729]

- I'll fix the analysis of search engine keywords which needs some
configuration to work with our ikiwiki logs. [#11649]

- I'll try to identify an example of smart insight we could get from
Piwik starting from more abstract goals and see how this gets down to
being analyze in Piwik. Starting with finishing to read [5].

[5]: https://www.nngroup.com/articles/ux-goals-analytics/