Project 5 - Data mining module, finding frequent network-itemsets

Student: Zaccone
Primary mentor: Mario Karuza
Backup mentor: Jeff Nathan

Google Melange: http://www.google-melange.com/gsoc/project/google/gsoc2012/zaccone/21002

Project Overview:
The project is still going to apply Data Mining/Machine Learning solutions, however the idea has slightly evolved after discussions with the project mentors, Mario and Jeff.
The main idea is to build a tool allowing to do statistical/data mining analysis on data from dionaea sensors. This may mean correlating source/dst addresses with ports, different attack id's by applying apriori algorithm. The data will be stored in the database (a schema will be designed) so that some tools (not a GSoC part) will be able to visualize the results.
The application will be however built on a light framework allowing users to plug in external modules and by writing configuration files, hence creating dynamic 'workflows'. This basically means that most of the loaded components (channels reading and 'understanding' data, processors channging the input into format that algorithm understands and than does the opposite work and loggers that work data into proper places like stdout, database or file) can be reused, many times, in different configurations (this is a general case, because every adjacent component needs to 'understand' its neighbour by design). I would also like to focus on making couple of 'ready to use' components - for example channels reading data from hpfeeds channels and some generic loggers for logging data to stdout doing some coloring and standard formatting.
The application should by design do it's job periodically, be able to collect data 'online' and on demand.

Project Plan:

  • April 23rd - May20th: Community Bonding Period
    Student keeps in touch with his mentors, discussing wide range of issues - from the high level architecture, project functionalities to the technology used for the project.
  • May21th - July 1st
    First version of the framework, allowing to dynamically (un)load channels/modules/loggers, read parameters from the configuration file. This means the framework should be ready and robust enough to build modules doing the real work.
  • July 2nd - July 9th
    First versions of channels reading various data. I would like to make one channel used in a final configuration of the project and some just as an extension for the project.
  • July 9th - July 13th: Mid Term Assessments
  • July 14th - August 10th:
    Work on the apriori data mining module. This also includes processor algorithm
  • August 13th: Suggested "pencils down" date, coding close to done
  • Working on project documentation.

  • August 20th: Firm "pencils down" date, coding must be done
  • August 24th - August 27th: Final Assessments
  • August 31st - Public code uploaded and available to Google

Project Source Code Repository:
Quechua

Student Weekly Blog: https://www.honeynet.or/blog/341

Project Useful Links:
The main idea about mining frequent itemsets was first proposed here

More detailed specification can be found here

Project Updates:
21 May 2012

Done last week:

  • Read autoconf/automake docs,
  • Read glibmm docs and did few example programs,
  • Made a conception od the high level architecture

28 May 2012

Done last week:

  • The project now compiles with autoconf/automake/libtool. Modules are compiled as .so shared files,
  • Main classes structure,
  • Configuration file structure described and partially implemented (using libconfig++),
  • Basic debug system
  • Loading shared modules based on data from configuration file,

Plans for next week:

  • Finish full config file parsing
  • Implement "interface" for smart fetching functions and class definitions from modules
  • Implement "dynamic" workflow building basing on the file configuration (loading proper algorithm class, proper channel, and processor)

Issues: none so far

05 June 2012

Done last week:

  • Fixed configuration file format
  • Loading channels and loggers and creating objects from shared modules
  • Some code refactoring

Plans for next week:

  • Make the config parsing and modules loading more roboust and less error-prone
  • Finish last week's task - intantiating Algorithm/Processor objects and making dynamic workflows

Issues: Had some problems with creating classes from shared modules, but everything works fine (however it took couple of hours and not 100% of plan was fulfilled). I will be offline for the weekend and hence cannot work on GSoC project.

11 June 2012 and 18 June 2012

My involvement was much smaller as I had to get things done with my school tasks.

  • Made a channel module cooperating with the main libev loop, listening on the local port and passing data to the internal application structures.
  • Did proper channels/loggers loading based on configuration file.
  • The code now should work properly when some components are missing (e.g. due to intentional configuration)
  • The code compiles under C++0x standard [changed configure.ac/Makefiles ]

25 June 2012

Done last week:

  • Added another modules to test the whole worflow,
  • Added functions and structures representing commands for running some tasks in a separated threads,
  • Crucial objects that are passed between certain components are now hidden under shared pointers,
  • Added standarised definition od a "Data package" holding data between certain components.

Plans for next week

  • Add code that 'cleans up' when the application is killed/signalled (freeing memory, closing sockets, stopping libev watchers),
  • Add signal handlers
  • Think about more general interfaces between 'adjacent' components (like channel<->processor, processor<->algorithm, processor<->logger etc.)

2 July 2012

Done last week:

  • Code stopping and destroying components
  • Added acinclude.m4 fixing ACX_PTHREADS error in configure.ac

Plans for next week:

  • Link Python library
  • Split interface.h files into multiple interface-*.h files
  • Embed python scripts calculating and converting dates and time

9 July 2012

Done last week:

  • Changed default config file name; installing conf in etc/, scripts in sbin/, default prefix set to /opt/quechua/*
  • interface.h splitted into interface-*.h files
  • Added logging functionalities (the -l switch)
  • Added simple Algorithm stub
  • Added time_converter.py file and methods in C++ for embedding Python scripts

Plans for next week:

  • Add error and sanity check in core modules
  • Add daemon mode
  • Improve logging mechanism

16 July 2012

Done last week:

  • Error checks in core modules
  • The program can be ran as a daemon
  • Improve logging mechanism
  • Blog post

Plans for next week:

  • Project and implement channel module reading data from dionaea.connections tables
  • Project channel for storing results

23 July 2012

Done last week:

  • Added DionaeaHarvester channel
  • Stub for ipproc processor
  • Stub for logging module

Plans for next week:

  • Implement ipproc processor and its internals
  • Finish logger module

30 July 2012

Done last week:

  • Fixed IpProcessor and PostPorcessorLogger - components more memory efficient
  • IpProcessor::ipIndex can handle ipIndex::ip by itself, or just store pointer (automatic memory management)
  • IpProcessor::postprocess(): copying dbresult_t object holding all the data
  • Added bootstrap.sh script for pre auto* tools

Plans for next week:

  • Implement Apriori algorithm
  • IpProcessor fixups and bug fixes

6 August 2012

Done last week:

  • Beta version of the Apriori algorithm module
  • Minor fixups in IpProcessor (former ippoc) module
  • Added extra logging information in case modules wasn't prepared/started correctly

Plans for next week:

  • Work on Apriori algorithm module
  • Improve hash tree in Apriori algorithm module
  • Add extra libev watcher for triggering iteration process in every Workflow

13 August 2012

Done last week:

  • Done with Apriori algorithm module
  • Changed DionaeaHarvester::indexedTransactions so they are more compatibile with itemsets class
  • Adapted logging module to the whole workflow
  • Added Stamp class for transporting custom information between modules (like extra information, objects or references/pointers)
  • Added Python script calculating range dates for SQL query in DionaeaHarvester module

Plans for next week:

  • Remove minor bugs and repair them
  • Start writing some docs
  • Move application to the new repository (Google Code)
  • Polish the code