datanommer: Making Fedora metrics more transparent

I kind of surprised myself when I realized I hadn’t blogged about this yet. I talked about it with Max, I talked about it with folks in #fedora-infrastructure, and I’m giving a talk at SELF that circles around this very project.

The Fedora Project, from the beginning of its collection of statistics surrounding itself, has been open and transparent about the numbers we get and how we get them.

There’s just one problem with that: a lot of the actual raw data isn’t publicly available.

Of course, we don’t want to go about publishing raw httpd access logs to public locations. We don’t want everybody to be able to see the IP addresses that visit fedoraproject.org. But we do want people to be able to come up with a number for themselves that answers questions like “how many distinct IP addresses visited fedoraproject.org between January 4 at 4:32 a.m. and February 2 and 6:28 p.m.?” without giving access to our log servers to everybody.

Or, even if the data is publicly available, it’s difficult to get that data because the application doesn’t provide an API of sorts (Mailman, for example). Writing a screen scraper for Mailman is non-trivial.

What if there was a central API that held raw data about the everyday activity of the Fedora community?

I plan to write that. And it shall be called “datanommer.” It’ll use the TG2 stack, at the request of Infrastructure, and, although it will be designed around Fedora’s existing infrastructure, will be agnostic so that other free software projects can use it right out of the box.

Here’s a quick summary of how it’ll work.

  • Applications that already make log files will have those transferred to our log servers by normal means. Applications that don’t already make log files will either use an extension, module or the like to write a log file, or an external script will create a log file, which will then be transferred to the log servers.
  • A cron job will populate a database used for datanommer based on those log entries.
  • The TG2 front end of datanommer will provide a RESTful API to access the data in the database. Applications that provide data and what data they provide to datanommer will be automatically documented for maximum usability.

At first glance, this may seem like a lot of hoops just to get some data. But here’s some reasons we’re doing it this way, specifically:

  • Less load on the app servers. If we programmed datanommer to collect data from each application about once per hour, the app servers and databases would be under somewhat heavy load while that data is generated.
  • If datanommer is down for some reason, it doesn’t matter, because data entry is done directly to the database.
  • If the database is down for some reason, it doesn’t matter. The cron job will just wait another hour to populate the databases.
  • If the log servers are down for some reason, it doesn’t matter. Logs are generated locally on each app server, much like httpd. The log servers will go through and pick up the logs when they get around to it.
  • If the applications are down for some reason, they won’t be generating any data anyway, so it doesn’t matter. :)

For the end-user, accessing the data will be extremely easy. Since a REST API is just based on query parameters, you don’t have to be an expert to download data. It’ll be encoded in JSON so it’s easy to use in any language (especially Python, the lingua franca of Fedora Infrastructure.)

Of course, your thoughts about this process are definitely wanted. You can comment on this blog post to leave your suggestions.

Edit: I forgot to include a bit about privacy — information that shouldn’t be publicly available, such as IP addresses or email addresses, will be stored in the database as UUIDs. Another table in the database will relate UUIDs to their original values for the purposes of allowing statistics to determine pageviews from distinct IP addresses, for example. Privacy is of top priority in this project and if we feel like we’re infringing on the privacy of our users and contributors too much, we will not report that information through this system.

3 thoughts on “datanommer: Making Fedora metrics more transparent

  1. A database is actually really not a good fit at all for holding a ton of access log type information. The number of rows gets big fast and then your queries start to get really, really expensive. Especially as a lot of the interesting information is various aggregates.

    That said, there is a lot of cool stuff out there which can help. It’s basically the problem that MapReduce exists to solve. And so looking at using Hadoop and Hive might make sense. But you’re now into a whole new world of infrastructure type stuff :-)

  2. Pingback: Ian Weller’s free software blog » Blog Archive » SouthEast LinuxFest: Growth is a good thing

Comments are closed.