Re: [Tails-ux] Report on Piwik prototype

Delete this message

Reply to this message
Author: intrigeri
Date:  
To: Tails user experience & user interface design, Tails system administrators
New-Topics: Re: [Tails-ux] Report on Piwik prototype
Subject: Re: [Tails-ux] Report on Piwik prototype
Hi,

thanks a lot for working on this!

sajolida:
> A single user is defined by:


>     (OS, browser + browser plugins + IP address + browser language)


Did you really mean plugins, or are add-ons taken into account as well?

> - I'm not sure boum.org will be able or ready to have IP in their logs.


It's worth asking. We could suggest they store these logs in RAM only,
and either help them get enough RAM to handle it, or retrieve the logs
as often as needed so they don't need too much memory.

> - We won't have analytics from people without JavaScript.


I suspect that's a pretty small portion of our website visitors, but
it would be sad to simply ignore them.

> - The JavaScript might not give us all the analytics we need.
> For example the hits on the security upgrade feed by Tails Upgrader.


Can we do both, i.e. importing logs *and* setting up the JS?
Will Piwik be able to de-duplicate hits?

> Resources needed
> ================


> - I'm running it on a dedicated X200 with a Core 2 Duo P8700 and 4 GiB
> of RAM


What kind of storage hosted the raw logs and DB?

> - I imported logs from November 10 to January 19 (71 days) and the
> database is now 13.0 GiB (from 1.1 GiB of gzip Apache logs).


Wow! Note that this impacts not only storage, but probably also RAM
requirements to handle the dataset efficiently.

> My understanding is that I generated the reports for this period but
> didn't get rid of the raw data from the database. This is possible to
> do but is a different process I think.


Good to know. We'll need to learn more about this whenever we
seriously think of deploying this in production.

> - Processing all reports for all this data takes several hours,
> maybe almost a day. I didn't try to process a report for a single day
> only.


I'm curious what was the bottleneck:

* Were all CPU cores used during this process?
* Was I/O a blocker, i.e. were processes blocked waiting for I/O?
* Was all available memory used by this process?
* Did you configure MariaDB in any way to optimize for large DBs?

> Lovely sysadmins
> ================


> I want guidance from the sysadmins team on how to move this forward and
> be integrated in our official infrastructure.


Now is a good time to ask, since we'll likely be upgrading our
hardware later this year.

To start with, we need the list of package dependencies, what access
you need beside a shell (e.g. write access to file X, ability to run
command Y as root), the list of DBs and directories to backup, and
resources requirements (ideally: current needs & what you'll need in
2 years).

We can discuss the specifics later of where to draw the line between
managing the other bits of the setup with Puppet vs. managing things
by hand. Each has serious pros & cons.

> The sysadmin work I had to do here was very little:


> - Default Apache configuration


Thankfully, it seems that nginx is supported as well.

> But on the long run there might be work needed to monitor the
> performance issues, tweak for better performance, etc. There is some doc
> about that on their website [4].


> [4]: https://piwik.org/docs/optimize-how-to/


Good to know. FWIW, what I found somewhat concerning at first glance:

* The part about Redis and Queued Tracking (which is currently
"BETA"), that will require us to dive into yet another technology
we don't know.

* We have no expertise internally wrt. efficiently handling large
datasets in a SQL database, nor about hosting a high-traffic PHP
webapp either, so the learning process will be slow and will take
us quite some time. Let's keep in mind that we have no such time
allocated at the moment.

Cheers,
--
intrigeri