How machine learning could thwart malware attacks
By Juston Moore
Malware is a favorite tool among cyber criminals for a broad spectrum of cyberattacks, from large-scale banking trojans that steal money from individual accounts to ransomware attacks that destroy data and have led to IT shutdowns at hospitals around the world. Large-scale theft of intellectual property is often accomplished by sophisticated, targeted malware attacks on organizations, such as the 2006 attack on the Democratic National Committee. But they can also attack individuals, like the VPNFilter malware attack that infected an estimated half-million home routers this summer.
Understanding the capabilities and intent of malware — a process known as reverse engineering — is a difficult, manual process that can take days or even weeks for an expert analyst. But at Los Alamos National Laboratory, we have found that expert intuition can be augmented by machine-learning tools that rapidly identify patterns across large sets of related malware, collected over time.
Our work is loosely based on a biological analogy of code evolution. Functional software — even malicious software — is difficult and expensive to create. Malware developers, like all software engineers, create their programs iteratively by incorporating existing code and refining existing malware to meet their objectives.
Once malware is detected by cyber defenses, attackers make only small changes to circumvent existing detection mechanisms — similar to the small mutations a biological virus develops to avoid destruction by the human immune system. For cyber defenders, it is critical to track these iterative refinements in malware because it allows them to compare new threats to previously analyzed attacks.
Defenders ask: is this new malware sample simply a cosmetic change to hide old code, or could the small change be a significant new strategy on the part of the attacker?
Code writers have a style, or voice, similar to writers who have recognizable ways of arranging their words. So the coder leaves fingerprints on the bits of malware code that remain unchanged, leaving a trail back to the source of the threat. This broad evolutionary analysis of malware, especially with an interest in source attribution, distinguishes our research from anti-malware efforts in industry that focus largely on blocking malware rather than studying it.
Our newest research is based on a kind of machine learning called “deep learning,” which is used to compute the similarities between related malware samples that have been disguised by attackers.
We take the same approach used in state-of-the-art language translation systems, such as Google Translate. In language translation, these novel deep learning methods summarize a sentence or paragraph in a language-agnostic, computerized representation. This pattern then becomes the key to decode the sentence or paragraph into other languages. Importantly, these language translation approaches are trained in a statistical manner, requiring only translated pairs of training documents in different languages. Similarly, we use sets of related malware code, collected over time, to learn a “translation” that allows us to track adversaries better than existing anti-virus tools.
Keeping up with innovative adversaries means we have to anticipate new types of threats and more sophisticated versions of existing ones. Malware analysis won’t prevent all cyber-attacks, though. The future of cybersecurity might instead depend on analyzing the behavior of an already-infected machine rather than just screening for malware as it arrives. While biological viruses operate according to their own objectives, computer viruses often facilitate remote control of their host by an attacker. The real signature of cyberattacks, therefore, is left by the actions of an attacker.
We hope, with our ongoing research on advanced user-behavior analysis, to uncover these patterns of attacks in real time. Whatever the future holds, cyberattacks will only grow increasingly sophisticated with each passing year — so must our ability to stop them.
Juston Moore is a data scientist and project leader in the Advanced Research in Cyber Systems group at Los Alamos National Laboratory.
Related video: