Introduction to CuckooML: Machine Learning for Cuckoo Sandbox
26 Aug 2016 Roberto Tanara cuckoo gsoc
CuckooML is a GSOC 2016 project by Kacper Sokol that aims to deliver the possibility to find similarities between malware samples based on static and dynamic analysis features of binaries submitted to Cuckoo Sandbox. By using anomaly detection techniques, such mechanism is able to cluster and identify new types of malware and can constitute an invaluable tool for security researchers.
It’s all about data..
Malware datasets tend to be relatively large and sparse. They are mostly made of categorical and string data, hence there is a strong need for good feature extraction approaches to obtain numerical vectors that can be feed into machine learning algorithms [e.g. Back to the Future: Malware Detection with Temporally Consistent Labels; Miller B., et al.]. Another common problem is concept drift, the continuous variation of malware statistical properties caused by never ending arms race between malware and antivirus developers. Unfortunately, this makes fitting the clusters even harder and requires the chosen approach to be either easy to re-train or be adaptable to the drift, with the latter option being more desirable.