Machine learning is more and more widely used in our lives and its applications become wider — machine learning helps us to make decisions, detect hidden patterns and also identify various types of anomalies. Machine learning is used in various industries — in medicine for more precise diagnostics, in retail for personalized offers and sales predictions, in banking sector for credit scoring. In this article I propose to consider an approach for using machine learning to increase authentication security.
Authentication is a process used in almost every organization, starting from an employee whose identity needs to be verified to access corporate information system and ending with a company client to provide personal service, for example banking services or social networks. Consider the importance of authentication, this functionality is often being attacked to gain access. Machine learning can help to detect potentially dangerous authentication attempts and take action to prevent them or alert administrator personnel.
For example, a system administrator may receive an alert about attacks from a specific IP address, or about a password mining for a specific account. Or, the authentication system will automatically offer the user an additional step for authentication — enter a one-time confirmation code via SMS or captcha. Thus, the authentication system adapts to the user’s behavior (becomes adaptive), so, if the user’s behaviour is suspicious, it can make authentication process more complex or deny authentication.
Machine Learning Implementation
There is no unambiguous methodology for building a machine learning model for adaptive authentication. It is rather a creative process, since the data of each organization is individual, in addition, some knowledge in the subject is necessary. Here is a general sequence of actions to implement machine learning for authentication below.
First you need to get the data that can be collected during the authentication process and which will be used for machine learning. For example, it could be
- IP address;
- User Agent header of a user’s device;
- mark, which can be either known or unknown device (based on the presence of a cookie from a previous authentication);
- authentication time;
- previous authentication time;
- time that has passed since the last authentication;
- whether authentication is performed from a trusted network or from the external Internet
- and etc…
Existing features may not be enough to train the model, so you need to create new features that can be generated on the basis of existing ones. For example, by IP address you can determine user location and the distance since the previous authentication. By combination of the distance and time since the last authentication you can calculate user movement speed. From the authentication time, you can calculate the day of the week, the time of day (morning, day, night, evening), whether authentication is on a holiday or not.
For example, if users are usually authenticated on workdays at labor hours, then authentication on weekends at night will be suspicious. So, authentication attempts from the different parts of the globe could be suspicious for a machine learning model.
Machine learning models work only with numbers, so you need to convert all text or boolean data to numeric ones. Thus, a known or an unknown device will be 1 and 0. Time of a day should be divided into 4 columns, whether authentication time is morning 1 and 0, whether time is day 1 and 0, and so on for the evening and night. If there are some invalid data entries, this entries should be excluded in order to avoid noise while training the model.
If you have events that have been already identified as fraud, then you can use them for supervised learning algorithms. In this case, machine learning algorithm determines correlation of features with the result (is the event an fraud or not). For supervised learning, you can use the following algorithms:
- Logistic Regression
- Neural Networks
- Decision Trees
- Gradient Boosting
- Random Forest
For model training you should random split your events into two sets of data — training set and test set. Training set should contain approximately 70% or 80% of data and test set should be 30% or 20%. Then, you should train algorithm with the train data and estimate model quality with test data.To achieve better result, you need to adjust parameters of the model, as well as remote features that do not correlates with the result and degrade the quality of the model. Additionally, you can combine models to achieve the best result in determining fraud.
If you do not have events that have been already identified as fraud, then you can use unsupervised learning algorithms. This means, algorithm learns from the data than has not been labeled. These algorithms can’t detect if the authentication is fraud or not, but they can detect anomalies, that potentially could be fraud.
For unsupervised learning you can use following algorithms:
- Local Outlier Factor
- One Class Support Vector Machine
- Isolation Forest
This article describes the most general approach to implement machine learning with the authentication process, without going into the technical details of algorithms or using specific software or libraries.
Besides, machine learning will not be able to prevent all attacks, for example, stolen user password, but it can identify suspicious user behavior, identify common fraud patterns, and increase the overall resistance of the authentication system to attacks. Of course, having deep domain expertise, it is possible to build an adaptive authentication system based on empirical data, but machine learning helps to identify hidden patterns that may not be available for human analysis.