Main Page

From StatsCollect

Revision as of 16:34, 14 March 2012 by Muriel (Talk | contribs)
(diff) ← Older revision | Current revision (diff) | Newer revision → (diff)
Jump to: navigation, search



[edit] Overview

StatsCollect is a minimalistic framework receiving orders to gather statistics from external sources, store them into backends and sending notifications about its execution. It is built around a plugins architecture where backends, external sources and notifiers can be extended at the user's will.

The main practical goal of StatsCollect is to scrap web sites (Facebook, Youtube, Twitter...) containing statistical data (number of friends, number of video plays, number of followers...) and to store this data into a common database.

[edit] Statistics

The statistics are represented as a collection of tuples containing:

  • A time stamp (when was the statistic taken)
  • A location (ie. MySpace, Youtube...)
  • An object (ie. a given video or song name)
  • A category (ie. number of plays, number of likes)
  • A value (ie. 42)

Statistics orders represent a range of statistics to gather. It contains:

  • A time stamp (when was the order issued)
  • A list of locations to get statistics from
  • A list of objects to get statistics from
  • A list of categories
  • A status (indicating if it has been processed, in error...)

[edit] Main flow

Here is StatsCollect main flow:

  1. It is invoked from a command line
  2. It queries the Backend to get a statistic order
  3. It processes the order by querying the different Locations, gathering statistics.
  4. It stores the gathered statistics in the Backend, along with new orders in error if needed.
  5. It queries the Backend for a new order, and goes to 3. unless there is no order anymore.
  6. It sends notifications in case of an activity (at least 1 order processed, or errors encountered).

This architecture is particularly adapted to a daemon. Each stats order can be processed in a transaction flavor (if the Backend plugin implements it).

[edit] Command line usage

Here is the command line usage:

StatsCollect.rb [--help] [--debug] --backend <Backend> --notifier <Notifier> --config <ConfigFile>
       --backend <Backend>          <Backend>: Backend to be used. Available backends are: MySQL, Terminal
                                    Specify the backend to be used
       --notifier <Notifier>        <Notifier>: Notifier used to send notifications. Available notifiers are: None, SendMail
                                    Specify the notifier to be used
       --config <ConfigFile>        <ConfigFile>: The configuration file
                                    Specify the configuration file
       --help                       Display help
       --debug                      Activate debug logs

[edit] Configuration file

The configuration file is a normal Ruby file evaluating to a hash map. It contains 3 sections, each of them storing specific parameters for each plugin. Here is an example of a configuration file:

   # Configuration for Backends
   :Backends => {
     'MySQL' => {
       :DBHost => 'localhost',
       :DBUser => 'mydbuser',
       :DBPassword => '*****',
       :DBName => 'mydb',
     'Terminal' => {}
   # Configuration for Notifiers
   :Notifiers => {
     'SendMail' => {
       :SMTP => {
         :address => "",
         :port => 25,
         :domain => '',
         :user_name => 'smtpuser',
         :password => '*****',
         :authentication => nil,
         :enable_starttls_auto => false
       # The From field
       :From => '',
       # To who the notifications are sent
       :To => ''
     'None' => {}
   # Configuration for Locations
   :Locations => {
     'MySpace' => {
       :LoginEMail => 'MySpaceLogin',
       :LoginPassword => '****',
       # This is the last part of profile URL
       :MySpaceName => 'myspace_user',
       # List the blogs IDs. They can be taken from their respective URL.
       :BlogsID => [
     'Facebook' => {
       :LoginEMail => 'FacebookLogin',
       :LoginPassword => '*****'
     'FacebookArtist' => {
       :LoginEMail => 'FacebookLogin',
       :LoginPassword => '*****',
       # URL of the page (after the /pages sub-directory) to fetch stats from
       :PageID => 'ArtistName/123456789012345'
     'ReverbNation' => {
       :LoginEMail => 'ReverbNationLogin',
       :LoginPassword => '*****'
     'AddThis' => {
       :Login => 'AddThisLogin',
       :Password => '*****',
       # List of objects for which we retrieve the AddThis stats
       :Objects => [
     'FacebookLike' => {
       # List of objects for which we retrieve the Facebook likes
       :Objects => [
     'Tweets' => {
       # List of objects for which we retrieve the tweets
       :Objects => [
     'Twitter' => {
       :Name => 'TwitterID'
     'GoogleSearch' => {
       # List of objects for which we will query Google search
       :Objects => [

[edit] Current plugins

Here is a list of the current implemented plugins.

[edit] Backends

[edit] MySQL

Connects to a MySQL DB. It retrieves orders from it, and also stores stats in it. It uses transactions for each stats order. It stores statistical maps (lots of data) in a compressed differential way. That is:

  • It stores the complete stat in a compressed way (using Zlib).
  • It later stores only stat differences, based on the most recent complete stat image.
  • It stores another complete image if too many differences have already been stored.

[edit] Terminal

Dumps all operations to the terminal, and gives just 1 order, collecting all stats everywhere. This plugin is mainly used for debugging purposes.

[edit] Notifiers

[edit] None

Does nothing. This is used when testing StatsCollect for debugging purposes.

[edit] SendMail

Sends a mail to a given recipient. SMTP parameters are fully configurable.

[edit] Locations

[edit] AddThis

Get statistics from objects indexed by an AddThis account.

[edit] CSV

Get statistics from a CSV file.

[edit] Facebook

Get statistics from a Facebook user account (friends).

[edit] FacebookArtist

Get statistics from a Facebook artist account (likes).

[edit] FacebookLike

Get statistics from a Facebook like button (likes). No need of a Facebook account to get such stats.

[edit] GoogleGroup

Get statistics from a Google group.

[edit] GoogleSearch

Get statistics from a Google search query. Counts number of pages fetched by Google.

[edit] MySpace

Get statistics from a MySpace artist account. Includes friends, visits, comments, videos and songs statistics.

[edit] RB

Get statistics from a Ruby file.

[edit] ReverbNation

Get statistics from a ReverbNation account. Includes friends, visits, comments, videos and songs statistics.

[edit] Tweets

Get statistics from a Tweet button. No need of a Twitter account.

[edit] Twitter

Get statistics from a Twitter account (followers, following, lists, number of tweets).

[edit] Youtube

Get statistics from a Youtube account. Includes video statistics.

[edit] Contact

If you wish to contribute to this project, send plugins, comment or ask for more details, don't hesitate to email me.


Personal tools