AI-Driven Methods for Detecting and Preventing Online Fraud
Contributor: David J Klein, PhD, Chief Scientist, 2predict, Inc.
Today’s fraudsters leverage increasingly sophisticated techniques to infiltrate networks and cause harm. From distributed DNS attacks to fake accounts, account takeovers, credential stuffing, and card cracking, their methods are varied and hard to detect — let alone preventing online fraud.
Fraud can have a tremendous financial impact on an organization in any industry. Bot attacks can function at extremely high speeds, undetected, and because they behave in a similar manner to humans, they’re difficult to spot. Financial organizations may be an obvious target, but businesses in retail and travel are equally vulnerable.
For example, a rising form of fraud in the travel industry involves purchasing requests. Browser-based bots can flood a travel site with reservation requests and lock up prices and inventory, which results in real buyers turning to competitors to reserve rooms and flights. And bots come in different forms — one bot can hit a site multiple times, or many bots can perform single actions simultaneously. Even worse, bots can be programmed to perform sequences of events just like a real user would, which makes them very difficult to detect.
Being able to distinguish between bot-driven and human behaviors is critical to an organization’s strategy for preventing online fraud. AI-driven techniques and solutions can help organizations do this and are particularly effective for this use case. By using the volume of data and traffic patterns and the way the requests come in, they can spot patterns and determine whether actions are being performed by a bot or a human. Better yet, the machine learning algorithms behind the AI get smarter over time.
Let’s look at three AI-driven approaches that can be used separately or better yet together, to identify malicious bots.
1) Supervised Machine Learning Traditional machine learning is supervised. In supervised machine learning, a model is trained on a labeled dataset. This means a domain-expert (human) developer must label sample data and select what kind of input and output sample data will feed the algorithm. Machine learning algorithms are used to make predictions about unavailable, future, or unseen data based on the labeled sample data.
The two most common types of supervised machine learning are classification, where incoming data is labeled and categorized based on past data samples; and regression, where the algorithm identifies patterns and calculates predictions of continuous outcomes. Decision trees, linear and logistic regression, and support vectors are examples of supervised machine learning algorithms.
Because each of the fields in a dataset is different — for example, some contain a single number while others a text-based description — each field must be turned into a feature. This takes manual work and requires developers with the domain expertise to understand the properties of each field and engineer the features.
Additionally, supervised machine learning requires a large set of labeled data, and someone must identify different kinds of threats as malicious or benign in advance. In many cases, not enough sample data exists, rendering the algorithm ineffective for detecting today’s sophisticated and rapidly changing attacks.
Supervised machine learning is good for learning patterns; but if the pattern changes, they may not recognize a threat. Engineers have to continually monitor whether things have changed, continue to label data and train it, then test and retrain the models. It’s a big loop that’s always turning.
2) Unsupervised Machine Learning In unsupervised machine learning, the algorithms are not dependent on a human domain expert labeling the data. The algorithms are trained to look for anomalies in the data and trigger alerts to notify fraud operations teams to investigate potential threats. Once the anomaly is flagged, teams can label the data and use it to feed supervised machine learning algorithms. This accelerates the process of training models and improves their effectiveness.
Unsupervised machine learning is often used to segment data into clusters that can be further analyzed to identify patterns. Two well-known applications of unsupervised machine learning include segmenting markets for targeting customers and anomaly/fraud detection.
3) Using Natural Language Processing (NLP) to Learn Features from Request Headers Natural Language Processing (NPL) is the study of programming computers to process and analyze large amounts of natural textual data, and NLP models can be used to automatically learn features from request headers, requiring a reduced amount of input from security domain experts.
Using recent ML-driven advances in NLP, you can avoid turning fields into features, and instead treat the headers as sentences or paragraphs, then bundle all them together into a single feature vector. This requires far less collaboration with domain experts and increases performance, as the model takes all features into consideration simultaneously.
Traditional machine learning limits the algorithm’s performance; any time you develop a feature, you have to decide what information to keep and what information to do without. Inputting a field into a model that includes a range, such as the passing of time, requires normalizing the data it contains. This reduces accuracy because you can’t include infinite ranges. By training a model to treat headers as natural language, you don’t have to omit information.
Advances in NLP are being driven by Open AI and other credible organizations such as GPT2.