https://mapr.com/blog/real-time-credit-card-fraud-detection-apache-spark-and-event-streaming/
Editor ' s Note: Has questions about the topics discussed in this post? Search for answers and post questions in the Converge Community.
In this post we is going to discuss building a real time solution for credit card fraud detection.
There is 2 phases to Real time fraud detection:
- The first phase involves analysis and forensics in historical data to build the machine learning model.
- The second phase uses the model in production to make predictions on live events.
Building the Model
Classification
Classification is a family of supervised machine learning algorithms, identify which category an item belongs to (for Example whether a transaction is fraud or not fraud), based on labeled examples of known items (for example transactions K Nown to is fraud or not). Classification takes a set of data with known labels and pre-determined features and learns what to label New records based On that information. Features is the "if questions" that's you ask. The label is the answer to those questions. In the example below, if it walks, swims, and quacks as a duck, then the label is "Duck".
Let's go through an example of car insurance fraud:
- What is we trying to predict?
- This is the label:the Amount of fraud
- What is the "if questions" or the "if" or "properties" can use to predict?
- These is the Features, to build a classifier model, and you extract the Features of the interest that most contribute to the Clas Sification.
- In this simple example we'll use the the the claimed amount.
Linear regression models The relationship between the Y "Label" and the X "Feature", in this case the relationship between The amount of fraud and the claimed amount. The coefficient measures the impact of the feature, the claimed amount, on the label, the fraud amount.
Multiple linear regression models the relationship between the or more "Features" and a response "Label". For example if we wanted to model the relationship between the amount of fraud and the the the claimant, the Claime D amount, and the severity of the accident, the multiple linear regression function would look like this:
Amntfraud = intercept+ coeff1 age + coeff2 claimedamnt + coeff3 * severity + error.
The coefficients measure the impact on the fraud amount of the features.
Let's take credit card fraud as another example:
- Example features:transaction Amount, type of merchant, distance from and time since last transaction.
- Example label:probability of fraud
Logistic regression measures the relationship between the Y "Label" and the X "Features" by estimating probabilities using A logistic function. The model predicts a probability which is used to predict the label class.
- Classification:identifies which category (eg fraud or not fraud)
- Linear regression:predicts a value (eg amount of fraud)
- Logistic regression:predicts a probability (eg probability of fraud)
Linear and Logistic Regression is just a couple of algorithms used in machine learning, there is many more as shown in t His cheat sheet.
Feature Engineering
Feature Engineering is the process of transforming raw data to inputs for a machine learning algorithm. Feature engineering is extremely dependent in the type of use case and potential data sources.
(Reference learning Spark)
Looking depth at the credit card fraud example for feature engineering, we goal is to distinguish normal card USA GE from fraudulent card usage.
- Goal:we is looking for someone using the card other than the cardholder
- Strategy:we want to design features to measure the differences between recent and historical activities.
For a credit card transaction we had features associated with the transaction, features associated with the card holder, and features derived from transaction. Some examples of each is shown below:
Model Building Workflow
A Typical supervised machine learning workflow have the following steps:
- Feature Engineering to transform historical data into Feature and label inputs for a machine learning algorithm.
- Split the data into a parts, one for building the model and one for testing the model.
- Build the model with the training features and labels
- Test the model with the test features to get predictions. Compare the test predictions to the test labels.
- Loop until satisfied with the model accuracy:
- Adjust the model fitting parameters, and repeat tests.
- Adjust the features and/or machine learning algorithm and repeat tests.
Read Time Fraud Detection solution in Production
The figure below shows the high level architecture of a real time fraud detection solution, which are capable of high perfo Rmance at scale. Credit card transaction events is delivered through the MapR Streams messaging system, which supports the Kafka. The events is processed and checked for fraud by spark streaming using spark machine learning with the deployed model. MAPR-FS, which supports the POSIX NFS API and HDFS API, is used for storing event data. MAPR-DB a NOSQL database which supports the HBase API, is used for storing and providing fast access to credit card holder Profile data.
Streaming Data Ingestion
MapR Streams is a new distributed messaging system which enables producers and consumers to exchange events in real time V IA the Apache Kafka 0.9 API. MapR Streams Topics is logical collections of messages which organize events into categories. In this solution there is 3 categories:
- Raw Trans:raw credit card transaction events.
- Enriched:credit Card transaction Events enriched with card holder features, which were predicted to is not fraud.
- Fraud Alert:credit Card transaction events enriched with card holder features which were predicted to be fraud.
Topics is partitioned, spreading the load for parallel messaging across multiple servers, which provides for faster Throu Ghput and scalability.
Real-time fraud prediction Using Spark streaming
Spark Streaming lets you use the same spark APIs for streaming and batch processing, meaning this well modularized spark f Unctions written for the offline machine learning can is re-used for the real Time machine learning.
The data flow for the real time fraud detection using Spark streaming is as follows:
1) Raw events come into Spark streaming as Dstreams, which internally is a sequence of RDDs. RDDs-Like a Java collecti On, except this data elements contained in RDDs is partitioned across a cluster. RDD operations is performed in parallel on the data cached in memory, making the iterative algorithms often used in machi NE learning much faster for processing lots of data.
2) The credit card transaction data are parsed to get the features associated with the transaction.
3) Card holder features and profile history is read from MAPR-DB using the account number as the row key.
4) Some derived features is re-calculated with the latest transaction data.
5) Features is run with the model algorithm to produce fraud prediction scores.
6) Non fraud events enriched with derived features is published to the enriched topic. Fraud events with derived features is published to the fraud topic.
Storage of Credit Card Events
Messages is not deleted from Topics when read, and Topics can has multiple different consumers, this allows processing O f the same messages by different consumers for different purposes.
In this solution, MapR Streams Consumers read and store all raw events, enriched events, and alarms to MAPR-FS for future Analysis, model training and updating. MapR Streams Consumers read enriched events and Alerts to update the Card holder features in mapr-db. Alerts events is also used to update Dashboards in real time.
Rapid Reads and writes with MAPR-DB
With Mapr-db (HBase API), a table was automatically partitioned across a cluster by key range, and each server is the SOURC E for a subset of a table. Grouping the data by key range provides for really fast read and writes by row key.
Also with mapr-db each partitioned subset or region of a table has a write and read cache. Recently read or written data and cached column families is available in memory; All of the provides for really fast read and writes.
All of the components of the architecture we just discussed can run on the same cluster with the MapR converged D ATA Platform. There is several advantages of have MapR Streams on the same cluster as all the other components. For example, maintaining only one cluster means less infrastructure to provision, manage, and monitor. Likewise, has producers and consumers on the same cluster means fewer delays related to copying and moving data between Clusters, and between applications.
Summary
In this blog post, you learned how the MapR converged Data Platform integrates Hadoop and Spark with real-time database CA Pabilities, global event streaming, and scalable enterprise storage.
References and more information:
- Free Online training in MapR Streams, Spark, and HBase at learn.mapr.com
- Getting Started with MapR Streams Blog
- Ebook:new Designs Using Apache Kafka and MapR Streams
- Ebook:getting Started with Apache spark:from Inception to Production
- Https://www.mapr.com/blog/parallel-and-iterative-processing-machine-learning-recommendations-spark
- Https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-spark-streaming-and-mapr-db
- Https://www.mapr.com/blog/apache-spark-machine-learning-tutorial
- Https://www.mapr.com/blog/life-message-mapr-streams
- Https://www.mapr.com/blog/spark-streaming-hbase
- Apache Spark Streaming Programming Guide
- Fraud Analytics Using Descriptive, predictive, and social Network techniques:a Guide to Data Science for fraud Detection Book, by Wouter Verbeke; Veronique Van Vlasselaer; Bart Baesens
- Learning Spark Book, by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
Real Time Credit Card fraud Detection with Apache Spark and Event streaming