Jack Cavar's QR code scanner

Image of me being really happy. If this is here I just havent imported it yet!

Whoami?

About Me

Hi! My name is Jack. My hobbies include reading copius amounts of Sci-Fi books, playing games, tennis and swimming.
Outside of these I thoroughly enjoy learning about new technologies and how they can be used to help people.
This website focuses on the work I did for my dissertation as part of my 4th year Ethical Hacking degree at Abertay University.

Download CV

Download free bootstrap 4 landing page, free boootstrap 4 templates, Download free bootstrap 4.1 landing page, free boootstrap 4.1.1 templates, meyawo Landing page

Project summary

A QR code scanner was created which could scan QR codes and determine if they were malicious or not. Most importantly, it tells a user why the code was malicious or not.
To complete this project, it was split into two parts. The first part was to create a machine learning model to determine if a QR code was malicious or not. The second part was to create an android app to scan the QR codes and display the results.
View the results and steps of the project below!

The problem

Phishing websites are becoming more and more prominent within websites. Some attackers place these malicious websites within QR codes creating an unidentifiable at a glance phish.

Most people trust a QR code without two thoughts of what could be embedded within. QR code scanners on the market don’t do a good job in protecting the end user.

What if there was an app made to fix this problem while still being as fast as a normal QR code scanner?

Project tools

To create this project, mostly two tools were used:
1. Python - For the machine learning model creation
2. Android Studio - For the android app creation

Dataset Collection

A suitable database was required to efficiently collect the required data for training the ML models. Phishing URL databases such as OpenPhish, EasyDMARC and an open-source phishing domain database were considered for the phishing dataset gathering process. These were not used due to their databases not having many entries and updating very slowly. When contacted about accessing their more frequently updated databases, no response was received.

To conduct this feature extraction methodology, a new dataset was created. This is due to current sources not containing any redirect or WHOIS information on the URLs which were scraped. This information couldn’t be gathered on top of what already existed in the dataset due to phishing URLs being removed quickly from the internet if found.

To create this dataset, websites were scraped from PhishStats and PhishTank for phishing websites and Alexa top 1 million websites for benign websites. Multiple sources were used due to potential bias or very closely related URLs being gathered repeatedly for the dataset. PhishStats is an automated gathering process while PhishTank is a user submitted process.

For each website scraped, the URL, number of redirects, WHOIS information and the HTML code of the first webpage of the URL were recorded. No more than one page was recorded due to project time constraints and the time saved in processing websites. Additionally, data was filtered to not include any dead links as it would ruin the quality of the dataset by not enough information being extracted.

20,175 phishing (malicious) websites and 49,524 benign websites were scraped over a 38-day period. This was done mainly using the python scrapy (to collect specific elements) and requests (to send and receive HTTP library allowing for targeted scrapes to occur against each active website. Download the dataset in the button below!

Feature extraction

56 different features were extracted from each website to be used as prediction factors when training the ML model.

There was 4 separate categories for each of these features. These were URL, HTML, JavaScript and Other (Request information)

A full list of the features extracted can be found in the table below. Your Image

Model Training

Six different ML models were chosen for development using the curated dataset to determine the best performing model. Each model was chosen for their fast-training times, high accuracy predictions when trained correctly and the ability to handle both categorical and numerical data. These include Decision Tree (DT), K Nearest Neighbours (KNN), Random Forest (RF), LightGBM, XGBoost and Naïve Bayes (NB). Additionally, all above mentioned models can utilise Binary Classification. This was used due to only being two potential answers malicious or benign being required of the model.

To ensure that the models created during development were appropriately constructed and evaluated, the benign and malicious website data was combined to construct a full dataset. Care was given to ensure that there were no duplicate entries within the data. This dataset was randomised and given an 80:20 split into training and validation datasets.

To carry out an extensive evaluation to determine the highest accuracy for each model, multiple training and validation sets were made with the feature parameters concatenated in two different ways:

- Rates, URL sequencing, TF-IDF vector

- URL sequencing, Rates, TF-IDF vector

This was done as the modification of the feature concatenation method was noted to affect the accuracy of model as confirmed in preliminary tests. This is assumed to be due to the feature importance being affected by what appears first in the feature order and what information is most readily available within the dataset provided to the ML model. By collecting from two it allowed at least some consideration into the model accuracy within the timeframe for the project.

Training of each model was carried out using python’s scikit-learn library. Binary Classification was used to ensure that the model trained to classify against two elements. During training, each model only used the training data for learning the classification. This ensured that the evaluation data could be completely unseen by the model to help provide an accurate result of the model’s performance.

Model Results

The accuracy from each model was used to evaluate and find the best model accuracy for each type. This was done by modifying the respected model’s hyperparameters and recording the results. As the evaluation data would contain potentially familiar data as the training, the highest performing model of each type was then evaluated against 1000 random URLs from a dataset which the model wasn’t familiar with. This reduced the biasness of the model. As with training, to ensure that the model gained all the potential information it could get, only URLs which gave a positive response were used in evaluation of unseen data.

Each model was evaluated for their accuracy, precision, recall and F1-Score. Each of the six models used were subject to testing their scores against the garnered validation data and 1000 random URLs. Each model’s hyperparameter’s were changed to get the highest scores possible.

For gathering the validation scores for each model, the data could be gathered within the same Python program which trained the model. Scikit learns built in calculation functions were used for the calculation of the validation data scores for each model. The original set of models were concatenated in the format of Rates, URL, TF-IDF.

The results against each model of this type can be found within the table below.

LightGBM was found to be the best performing model against the validation data

A separate Python program was used to test each created model against 1000 random, unseen URLs. Using an external dataset, a URL would be selected and a feature vector would be formed which was then predicted on. To ensure a fair test and chance for each feature to be analysed, the 1000 URLs were only recorded if a successful request could be made to the website.

App Design

Simple interfaces with smooth transitions and animations were created to allow for an interactable and usable experience. Only two windows with easy to navigate, popup windows using jetpack composes dialog boxes were used. This allowed for an easy method to display information which may not be as pertinent to the user while also allowing a user to view it if they so desire.

Scanning page:

Results page:

User design was considered during development to ensure all elements within the app was easy to use and view for an end user. A traffic light system of colours and simple interfaces has been implemented. New elements and features have also been implemented.

The App displays the result of the ML model to the user through three different identifiers. Malicious meaning the content is most likely a phishing website, Benign meaning that the content is most likely same based off the prediction by the model, and unknown meaning there was either an error with the extraction process, the website didn’t send back a positive result so an accurate prediction could be made, or the content scanned wasn’t a URL.

Colourful, clear indicators have been used throughout the app to ensure user has a strong, quick understanding of the content within the embedded code. This is especially prevalent within the definitions of malicious, benign, and unknown to a user being colour coded throughout the app.

A wheel of options was created to give users the option to look at sections. This contains pertinent information which is contained within the website and URL. The overall goal of this was to provide subtle suggestions to a user or to help someone who wants to know more information about the content within to make their decision about whether to access the content. This information is extracted as part of the ML model’s feature gathering process to make a prediction, so it requires limited extra resources in processing on device.

The names of the categories have been modified to provide more meaning to an end user who may not be as understanding on specific programming information. Website Link refers to features affecting the URL of the site, Website Content refers to the HTML aspects of the page, Functionality refers to the JavaScript features of the page and Other refers to the redirect and WHOIS information of the page.

Horizontal pagers have been utilised to split up the information within each section of the categories. This allows for each page in the box to contain a specific piece of information one at a time on a potential positive or negative element of the embedded content in the code scanned. Each page is colour matched to keep everything in the flow. Each content modifier changes depending on a modifier determining if it’s suspicious or not. The modifier values have been determined by the numerical averages collected from the feature rates of the dataset collected for this project as discussed in Dataset collection and feature extraction.

Users can access the content browse to a page if a link is within the QR code. As this is a suggestive application, a user can view the content of a page flagged as malicious. There is however an additional popup warning the user of the potential risk beforehand.

App Privacy

User permissions asked for include the front camera of the device. These have been kept as limited as possible to ensure that the app doesn’t take any more permissions than are required and user data is kept as anonymous as possible. While the android internet permission has been included within the code, this is to cater to older android devices which may require them. Internet permissions are considered default permissions on all Android devices past android 11 hence a notification is not provided when using the app.

Machine learning app integration

Two methods were considered which could have been used to integrate the ML model within the app, namely API (External calls) or local inclusion. Local inclusion was chosen to reduce the potential permissions a user may need to provide to use the app. By choosing local implementation, the model is more efficient due to no external communication wait times.

One negative aspect of choosing a local implementation of a ML model is that a created model on a desktop must be converted to a suitable format for a mobile device. Luckily, there are multiple conversion tools available for python which list the process to convert a model into a format capable of running within a mobile device. There are various python libraries available to convert the models created into a suitable format. Each being their own unique approach due to the makeup of a model’s structure being different for each algorithm used. For android applications, there are specific model formats which are supported to allow for a ML model to run locally as they are optimized for that platform. The main three methods are using TensorFlow Lite, ONNX and Pytorch Lite.

During preliminary tests, it was found that the only method in getting the model to run within the app environment was using TensorFlow Lite, hence a method to construct each model to a suitable format was needed. The tool python tool onnx2tf was used to convert the produced onnx model to a tflite format capable of running locally on the phone. By finding a suitable conversion method for each model to onnx, it was possible to convert each model type. For example, the python Hummingbird library could be used to convert a LightGBM model to a pytorch model. The model could then be converted to onnx using pytorch’s built in onnx export function.

Chaquopy was used to ensure that exact feature vector values were collected within the app which were identical to what would be found when running the model logic within a python program on a computer. Chaquopy allows for a full python interpreter to be run within an android application. This allowed for near identical code to run smoothly on both devices.

Overall Results

A survey was created to gather results for the application created as part of the project for scanning QR codes using the designed model. Each participant was given a Redmi 9AT phone with the app installed for testing the app. 21 participants responded to the survey.

The SUS score works on a scale of all an individual score for each question to understand if a function of the application is usable or not and if it is something to work on in the future. All score for each odd question is added together with 1 removed from each individual score. All scores for even questions are added together with each individual score being reduced from 5. These new scores for each question are then multiplied by 2.5 to get the SUS score. The SUS score for a particular question is considered good if above 68. The overall SUS score from the survey was 93 out of 100 which is excellent and can be perceived as an A+ system for usability as the score is above 85.

Feedback overall from the application was that it was an excellently made product and with some tweaks could be made to work for any persons interests and needs.

Overall the machine learning model integration and feature selection provides a unique method of displaying a wide range of information and further improving people’s lives by providing a safe option for scanning a QR code.

Download the Dataset and App!

The dataset and app are available as free open source downloads for your use!
Please note that the app is in prototype and may suddenly break without question or not function whatsoever.

Download App APK