The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. Register now!
The last decade’s growing interest in deep learning was triggered by the proven capacity of neural networks in computer vision tasks. A neural network can be trained with enough labeled photos (cats and dogs) to identify recurring patterns and classify unseen images with reasonable accuracy.
What else can you do using an image classifier to help?
In 2019, a group of cybersecurity researchers wondered if they could treat security threat detection as an image classification problem. They were right to trust their intuition and created a machine-learning model that could detect malware using images taken from application files. A year later, the same technique was used to develop a machine learning system that detects phishing websites.
Combining machine learning and binary visualization is a powerful technique that can solve old problems. This technique is promising in cybersecurity but could be used in other areas.
Detecting malware with deep learning
The traditional method of detecting malware is to search for malicious payload signatures in files. Malware detectors keep a list of virus definitions that includes code snippets and opcode sequences. They search for these signatures in new files. Malware developers have many ways to circumvent detection methods, including obfuscating code or using polymorphism to modify their code at runtime.
Dynamic analysis tools attempt to detect malicious behavior at runtime. However, they are slow and need to be set up in a sandbox environment for testing suspicious programs.
In recent years, researchers have also tried a range of machine learning techniques to detect malware. These ML models have been able to overcome some of the problems of malware detection such as code obfuscation. They present new challenges.
Binary visualisation can help you detect malware by making it a computer vision problem. This method involves running files through algorithms that convert binary and ASCII data to color codes.
In a paper published in 2019, researchers at the University of Plymouth and the University of Peloponnese showed that when benign and malicious files were visualized using this method, new patterns emerge that separate malicious and safe files. These differences would not have been noticed using traditional malware detection methods.
According to the paper, “Malicious files have a tendency for often including ASCII characters of various categories, presenting a colorful image, while benign files have a cleaner picture and distribution of values.”
When you have such detectable patterns, you can train an artificial neural network to tell the difference between malicious and safe files. Researchers created a database of binary files that could be visualized. This included benign and malicious files. The data contained malicious payloads such as viruses, trojans and rootkits. File types (.exe. doc..pdf..txt etc.)
The researchers used the images to train an algorithm for classifying neural networks. They used the self-organizing, incremental neural network (SOINN) as their architecture. This is fast and can handle noisy data well. They also used an image preprocessing technique to shrink the binary images into 1,024-dimension feature vectors, which makes it much easier and compute-efficient to learn patterns in the input data.
The resulting neural network was efficient enough to compute a training dataset with 4,000 samples in 15 seconds on a personal workstation with an Intel Core i5 processor.
Experiments by the researchers showed that the deep learning model was especially good at detecting malware in .doc and .pdf files, which are the preferred medium for ransomware attacks. Researchers suggested that the model could be made more efficient by adjusting to include file type as one of its learning dimensions. Overall, the algorithm achieved an average detection rate of around 74 percent.
Detecting phishing websites with deep learning
Phishing attacks have become a problem for both individuals and organizations. Many phishing attacks lure victims to click on links that lead them to malicious websites. These sites pose as legitimate services and they ask for sensitive information, such as financial or credentials information.
Traditional methods for detecting phishing sites revolve around whitelisting safe domains and blacklisting malicious domains. The first method is not effective and can lead to new phishing websites. Once someone falls for it, the second is restrictive and requires significant effort to access all safe domains.
Other detection methods rely upon heuristics. Although these methods are more precise than blacklists they still fail to provide optimal detection.
In 2020, a group of researchers at the University of Plymouth and the University of Portsmouth used binary visualization and deep learning to develop a novel method for detecting phishing websites.
This technique uses binary visualization libraries that transform source code and website markup into color values.
As is the case with benign and malign application files, when visualizing websites, unique patterns emerge that separate safe and malicious websites. Researchers write that the legitimate site would have a higher RGB value due to additional characters obtained from licenses, hyperlinks and detailed data entry forms. The phishing counterpart would have a single CSS reference or none, multiple images instead of forms, and one login form without security scripts. This would result in a shorter data input string when scraped .”
The following example shows how the code of a legitimate PayPal login is visually represented compared to one created by phishing PayPal websites.
The researchers created a dataset of images representing the code of legitimate and malicious websites and used it to train a classification machine learning model.
The architecture they used is MobileNet, a lightweight convolutional neural network (CNN) that is optimized to run on user devices instead of high-capacity cloud servers. CNNs are especially suited for computer vision tasks including image classification and object detection.
Once the model has been trained, it can be plugged into a tool for phishing detection. The model checks first whether the URL has been added to its list of malicious domains before it is plugged into a phishing detection tool. If the domain is new, it is first transformed using the visualization algorithm. Then it is run through the neural networks to see if it matches the pattern of malicious websites. This two-step architecture ensures that the system is able to use the speed of blacklist databases as well as the smart detection of neural network-based Phishing detection techniques.
The researchers’ experiments showed that the technique could detect phishing websites with 94 percent accuracy. Visual representation techniques allow for a deeper understanding of the structural differences between legitimate and fraudulent web pages. The method is capable of quickly detecting phishing attackers with high accuracy, based on our initial experiments. The method also learns from misclassifications and improves efficiency,” researchers wrote.