Project 2 - HonEeeBox Data Management Interface

Primary mentor: David Watson (UK)
Student: György Kohut

Project Overview:
Creating a central public web-based malware information service based on continuous data collection from the Honeynet Project sensor network. The system will serve as a central repository for sensor collected data with mechanisms to enrich the data by invoking related services (dynamic malware analysis - CWSandbox/Anubis, virus scan - VirusTotal, geo-IP lookup) and by capturing/generating statistics from the collection process/collected data. The resulting data will be exposed through a rich web-based interface that aims to allow easy exploration of the presumably large set of information, as well as to offer overview on high level trends.

Project Plan:
The general architecture of the system consists of three main components:

  1. Data collection back-end
  2. Web front-end
  3. Relational database back-end (shared by component 1 and 2)

Component 1 and 2 will be developed from scratch and represent the two main deliverables of the project.
Component 3 will be an actual stable release of PostgreSQL. Optimally PostgreSQL 9+ to leverage the new replication features for load balancing read-only queries from the web front-end or when running presumably expensive statistics queries.

1. Data collection back-end

This component interfaces with the Honeynet Project sensor network and processes the the sensor collected further (mostly) by invoking external/third party services. The resulting data is stored in the database back-end.

For minimum functionality, the following interfaces will be provided:

  • Interface for Dionaea submissions (Honeynet pub/sub network TODO: clarify)
  • Third party:

  • CWSandbox web service (
  • Anubis web service (
  • VirusTotal (
  • (if not going with offline geo-ip lookups TODO: clarify)
  • The component has a message driven architecture. Sub-components responsible for processing or providing interfaces are largely self contained and are connected by message queues to each other to form the component's workflow.

    Generally, the workflow is triggered immediately by the arrival of a new Dionaea submissions, however the design should allow fair amount of control/extendability.

    The component will be implemented in Java, largely within the semantics of the JMS API. Durable messaging and distributed transaction support to coordinate message queue and database access will be used for robustness.

    Which third party components (most notably the JMS provider and the transaction manager) will be used for the implementation is yet to be decided upon until the coding starts. Probably, the Spring framework will be used to wire parts together. Optionally, the GlassFish 3.1 application server could be considered as the runtime environment as it provides a JMS provider and a transaction manager out of the box.

    2. Web front-end

    This component presents the collected data and generated statistics through a rich HTML/Javascript interface. This includes at least the following:

  • Dionaea submission data
  • Sandbox analysis results
  • Antivirus engine scan results
  • Real-time statistics of component 1
  • Visualized statistics
  • Generating statistics (and possibly visualizations) is considered to be part of this component. These operations are carried out in a batch-like manner and results are cached.

    The component will be implemented in Python using the Django framework on the back-end and the Ext JS library on the front-end.

    Usage of additional/alternative third party components will be decided upon until the implementation of this components starts (see the timeline).

    Project Timeline:

    - May 23 (Community bonding period)
    Community bonding period:
    - planning
    - evaluating/choosing/setting up tools/third party components/libraries

    May 24 - July 10 (First internim period)
    May 24 - June 12
    - writing code for component 1
    June 13 - June 19
    - writing code for component 1
    - begin to move gradually towards real world testing component 1
    June 20 - June 26
    - testing and patching of component 1
    - writing documentation of component 1
    June 27 - July 3
    - testing and patching of component 1
    - begin to gradually change focus to component 2
    July 4 - July 10
    - component 1 considered "finished for now"
    - focus changed to component 2, begin to write code
    July 11 - July 15 (Mid-term evaluation period)
    - writing code for component 2
    - mid-term evaluation
    July 18 - August 15 (Second internim period)
    July 18 - July 31
    - writing code for component 2
    August 1 - August 7
    - writing code for component 2
    - begin to move gradually towards real world testing component 2
    August 8 - August 15
    - testing and patching of component 2
    - writing documentation of component 2
    August 16 - August 21 ("Pencils down" Period)
    - testing and patching of component 2
    - general review of code and documentation
    August 22 - Firm "pencils down" date
    August 26 - Final evaluation deadline


    - Firm "pencils down" - final update (for now)

      Further work on the "REST back-end", in particular, reworking it with caching in mind. Built a quick test page with flot and Django for charting the current test case: submissions per minute for a given MD5.
      So the web front-end isn't really there at this point, rather, it's a "proof of concept" with lots of TODOs, but it's generic enough and hopefully, it will be a good foundation to build on for other interesting in-browser visualizations of larger amount of data with quick response times.

      Of course, there're not only charts to show on the web front-end, but seeing it as a whole, it's probably the hardest part, so I wanted to start with that, to get it as complete as possible. I'm planning to work on the project post-GSOC, to finish up what's already there in some form, and to add what was imagined but didn't fit into the timeframe (more about that later in a separate post).

    - August 15

      Further work (mostly) on time-series data. Finally, have chosen to store it in InfiniDB. Seemingly substantially better performance for this scenario, than querying directly the "master" database of the back-end. Building a RESTful back-end that serves data from InfiniDB encoded as JSON for charting in the browser.

    - August 8

      Working on component 2. Implementing incremental generation of time-series (charting) data over submissions record contents (MD5, attacker IP, geolocation and AS resolved from the IP, as well as combination of these e.g. individual geolocations for a given MD5) for fast retrieval through the web front-end.

    Planned for this week:

      Finish the above mentioned and build a web front-end that can show the data.

    - July 25

      Planning of component 2. Did some research on appropriate analysis tools to complement Postgres. Hadoop and Hive/Pig seems to be a good fit. As submissions records will be the main (and fast and ever growing) source of statistics, Hadoop tools may provide more efficient means for some tasks. Further evaluation is planned.

    Planned for this week:

      Further planning.
      Implement pushing some (near) realtime statistics from the back-end to the browser.

    - July 15 Mid-Term

      Still working on the back-end. Cleaned up and released code for testing for the mid-term assessment.
      Implemented module for Shadowserver DNS origin lookups.

    Planned for next week:

      Back to working on interfacing with Anubis and CWSandbox.
      Start moving over to component 2.

    - July 4

      Implemented module for VirusTotal lookups using their JSON API. Working on interfacing with Anubis and CWSandbox.

    Planned for this week:

      General code cleanup, documentation.
      More testing.

    - June 27

      Implemented module for Team Cymru's whois service. Working on the interface for VirusTotal.

    Planned for this week:

      Finish the interface for VirusTotal.
      Implement interfaces for Anubis and CWSandbox.
      Begin planning component 2.

    - June 13

      The submit_http interface is almost finished. It needs to be (stress-)tested on longer runs with live honeypots submissions.
      Otherwise, working on the interfaces for VirusTotal, Anubis, CWSandbox.

    Planned for this week:

      Set up a testing environment with access to higher traffic honeypots and start testing the submit_http interface with live data.
      Start implementing modules for interfacing with third party services.
      Finish one of the three above mentioned interfaces completely.
      ...alternatively, work on the interface for hpfeeds submission method.

    - June 6

      Laid out fundamentals for component 1(back-end).
      The initial development/testing/deployment environment will be Glassfish 3.1.1 (currently pre-release). The back-end will be composed of a set of EJBs and OSGi bundles and/or inbound JCA connectors.
      Set up Glassfish and PostgreSQL 9 for development and worked out some administration practices that will be relevant for later deployments.
      Laid out code project structure and buildfile producing the above types of modules. Wrote test code to evaluate and learn details on the capabilities of the above setup (regarding JMS, OSGi, distributed transactions, threading).
      Currently implementing the interface for the submit_http submission method for Dionaea (and Nepenthes) submissions.

    Planned for this week:

      Finish the submit_http interface and start testing it with real data, ideally, with real honeypots.
      Start implementing modules for interfacing with third party services.

    Source Code
    HonEeeBox development is still a work in progress, but you can find snapshots of the GSoC project code here: