How to use large data training risk control mathematical model has always been the challenge of PayPal in the detection of fraud transactions. PayPal has experienced four stages in risk control model training:
Decision Tree: Early PayPal uses a simple decision tree model, mainly due to the relatively small amount of data in the early model training, and the results of the decision tree model are easy to interpret. Logistic regression: When PayPal's business is becoming more and more complex, the control model is becoming more and more complex, and using logistic regression can easily handle more data and more features, and PayPal's line-control service can quickly realize the mathematical model of these logistic regression. Neural network: In order to make up for the limitation of the number of logistic regression features, PayPal uses neural networks to train the mathematical models with thousands of eigenvalues, but the training data is limited by a single machine because there is no distributed training framework and products. Distributed neural Network and logistic regression: The Hadoop iterative computing framework Guagua, which solves the problem of distributed training of large data, so that all the mathematical models of the wind control of PayPal no longer have a single data limit, and the maximum number of models currently supported is more than 2,500.
Among them, Hadoop iterative computing Framework Guagua is a PayPal open source machine learning Framework Shifu project, has been open source this year.
Zhang Penshan is the PayPal disorientated Data Science Department of Research and Development engineers, in PayPal has been committed to using Hadoop to solve the risk control mathematical model of feature extraction, training and validation work, is Shifu and Guagua's main developers. Infoq Chinese station editors recently interviewed Zhang Penshan about the development background and application of the framework.
InfoQ: First of all, why does this frame take the name Guagua?
Zhang Penshan: This name is actually very casual. Last year during the company decoration, I developed Guagua at home, the lack of a suitable name, I casually turned over my son usually like a story book, see above a small duck called "quack", I was conveniently used this name. Later wait until Guagua molding, has been trying to change a name, but at that time Guagua in the company has a lot of visibility, and my colleagues helped me design a very beautiful Logo,guagua has been used so far.
InfoQ: What are the business characteristics of risk control training?
Zhang Penshan: The main characteristics of the mathematical model of risk control are the large amount of training data, the characteristic of model, the low universality of model and so on.
InfoQ: What are the characteristics of its training algorithm? What are the public or you know methods in the industry, and what are their characteristics and differences?
Zhang Penshan: Its training methods and other classification problems are not much different, the only big difference is how to use large data to train mathematical models. There are many related algorithms in the industry, such as decision Tree, logistic regression, neural network, SVM and so on. Even Apache Mahout does not make the classification model distributed well (both the logical regression and the neural network are all stand-alone algorithms in Mahout).
InfoQ: Why develop Guagua? In other words, why is guagua a better framework for your business characteristics?
Zhang Penshan: In Paypal,guagua, we mainly solve the problem of distributed training of machine learning classification model, we have no training framework or product of distributed model, we can only use sampling to limit our training data to single scale. In addition, we used to train a wind control model for about 10 hours, due to the limitations of single machine computing resources and memory. Using Guagua, data and computing are distributed on top of Hadoop, not only the training data reached the TB level we didn't think of before, but also the training time was reduced from about 10 hours to about 1 hours, and the final model did not have any performance loss compared to the single machine.
Infoq:guagua where do you now meet your requirements, which aspects are not perfect, plan to do what work to improve it?
Zhang Penshan: Guagua mainly solves the distributed problem of model training, now PayPal can use large data to train the mathematical model of wind control quickly. At the same time, Guagua does not confine itself to the classification model, Guagua is an iterative computing framework based on Hadoop, and almost any iterative algorithm can use Guagua to add distributed functionality to it. In addition, because of Guagua's good support for distributed, many of the previous tasks we wanted to do and could not do, such as model feature automatic selection, can be carried out.
Guagua currently supports the iterative computing framework of the synchronous master-workers structure, we hope to support the iterative computing framework of Asynchronous mode, in 2012, the father of Google MapReduce, Jeff Dean, published a paper The above mentioned the support to the neural network depth model, the article introduces their DISTBELIEF Framework Training neural network mathematical model can support 1 billion level other parameters. This is another direction of Guagua, which supports a large scale deep neural network model.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.