Cuckoo Sandbox ML

Project Name:Project 1 - CuckooML
Mentor: Hugo Gascon (ES)
Backup mentor: Jurriaan Bremer (NL)
Skills required: Python, familiarity with Scikit-Learn and machine learning concepts, knowledge of Theano and Keras is also welcome
Project type: New technology in existing tool.
Project goal: Implement a new machine learning module in Cuckoo to perform clustering, anomaly detection and classification of existing and new behavioral analyses.
Cuckoo Sandbox (developed during GSoC 2010-2015 with The Honeynet Project [1]) has evolved to become the de-facto open-source standard for malware analysis systems. It contains capabilities for analyzing in malware in various Windows, Android[2] and Apple [3] environments, has a clean architecture and easy-to UI. It is used by many open source and commercial sandboxing efforts, including Google's own VirusTotal infrastructure.

The goal of this project is to develop a module for machine learning in Cuckoo using Scikit-Learn [4] that hould be able to cluster all reports according to similar behaviors. Given a class, the module will be able to find the most representative element (prototype) of each class. Once that a clustering exists and a new sample is analyzed, the new report can be assigned to one of the clusters and compared with similar samples. The module should also be able to perform anomaly detection, so alternatively, if no similar behavior is observed, a new cluster should be created. It should be possible to choose among several methods to do this. For example, the distance to the clusters could be measured or an SVM could be trained on existing data using the cluster labels. After the functionality based on stored analysis data from Cuckoo Sandbox is implemented, the module will be integrated into Cuckoo for command line and web-based interaction.
All code written by Kacper can be found on GitHub. He also kept a dedicated blog with weekly updates. Final achievements have been published under the Honeynet blog.