& mathematic model of how to use big data to train risk control has always been PayPal's challenge in cheating transaction detection. PayPal's training in risk control models After roughly four stages:
Decision tree: Early PayPal used a simple decision tree model, mainly due to the small amount of data trained in earlier models, and the results of decision tree models were easy to interpret. Logistic Regression: As PayPal's business becomes more complex and sub-control models become more complex, logistic regression can easily handle larger amounts of data and more features; and PayPal's online risk control service can A mathematical model that quickly realizes these logistic regression. Neural Networks: To compensate for the number of logistic regression features, PayPal uses neural networks to train mathematical models with thousands of eigenvalues, but training data has been subject to stand-alone limitations due to the absence of distributed training frameworks and products. Distributed Neural Network and Logistic Regression: The Hadoop Iterative Computational Framework Guagua emerges and solves the distributed training problem of big data so that all WindPower mathematical models for PayPal no longer have stand-alone data restrictions and the maximum model currently supported The number of features has exceeded 2500.
Among them, Hadoop Iterative Computing Framework Guagua PayPal is an open source machine learning framework Shifu a subproject, has been open source in April this year.
Cheung is a research and development engineer at PayPal Risk Data Science. PayPal is a leading developer of Shifu and Guagua software that uses Hadoop to solve feature extraction, training and validation of risk control math models. InfoQ Chinese station editor recently interviewed Zhang Pengshan to understand the development of this framework and application of the situation.
InfoQ: First, ask why this framework took the Guagua name?
Zhang Peng Shan: The name of the come is actually very casual. During the renovation of the company last year, when I was developing Guagua at home, I did not have a proper name and I readily turned over a story book my son usually liked. I saw a little duck called "Quack" Just use this name. Later, until Guagua was formed, I always wanted to change my name, but at that time, Guagua had a great reputation inside the company and my colleague helped me to design a very beautiful Logo. Guagua has been in use ever since.
InfoQ: What are the business characteristics of Risk Control Training?
Zhang Pengshan: The main features of the mathematical model of risk control are large amount of training data, many model features, low universality of the model and so on.
InfoQ: What is the trait of the training algorithm? What are the industry's public or you know the method, each have their own characteristics and different?
Zhang Pengshan: Its training methods and other classification problems and not much difference, the only big difference is how to use big data to train mathematical models. The industry has a lot of related algorithms, decision trees, logistic regression, neural networks, SVM, etc., but are mostly standalone implementations. Even Apache Mahout does not distribute the classification model well (both Mahout's logistic regression and neural networks are standalone algorithms).
InfoQ: Why Guagua? In other words, why is Guagua a framework that better suits the characteristics of your business?
Zhang Peng Shan: At PayPal, Guagua mainly solves the distributed training problem of machine learning classification models. In the past, we did not have a training framework or products for distributed models. We can only limit our training data to stand-alone size by sampling. In addition, due to the limitation of standalone computing resources and memory, we have traditionally trained a wind-control model in about 10 hours. With Guagua, both data and computation are distributed over Hadoop, not only to train the data to a terabyte level that we were not able to imagine before, but also to reduce the training time from about 10 hours to about 1 hour, and the resulting model has no Stand-alone has any performance loss.
InfoQ: Where does Guagua now meet your requirements and what is not, and what are you planning to do to improve it?
Zhang Peng Shan: Guagua mainly solves the distributed problem of model training, and now PayPal can use the big data to quickly train the wind-control mathematical model. At the same time, Guagua does not limit itself to the classification model. Guagua is an iterative computing framework based on Hadoop. Guagua can add distributed functions to almost any iteration-based algorithm. In addition, thanks to the good support of Guagua for distribution, many of the tasks that we have previously thought of and could not do, such as the automatic selection of model features, can be carried out.
Currently, Guagua mainly supports the iterative computing framework of the synchronized Master-Workers structure. In the future, we hope to support the asynchronous computing framework. In 2012, Jeff Dean, the father of Google MapReduce, published a paper on the neural network Depth Model Support The article describes their mathematical models of DistBelief framework training for neural networks that can support 1 billion levels of parameters. This is another direction for Guagua, supporting a very large-scale deep neural network model.