Large Data instance: Analysis of high load and low latency dynamic algorithm

Last Update:2014-12-18 Source: Internet

Author: User

Keywords We all provide large data

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article, provided by Datasalt's founder Ivan de Prado and Pere Ferrera, is a company focused on large data, launching open source projects such as Datasalt and Pangool SQL big. In this article, a low latency scheme in cloud computing is illustrated by an example of a BBVA credit card payment.

The following is the full text of the article:

The amount of money paid with a credit card is huge, but it is clear that by analyzing all the transactions, we can also get intrinsic value from the data. such as customer loyalty, demographic data, popularity of activities, store recommendations, and many other statistics, are very useful for businesses and banks, and can improve their contact with the market. In Datasalt, we have worked with BBVA to develop a system that can analyze years of data and provide different scenarios and statistics for Web applications and mobile applications.

In addition to the need to deal with the main challenge of large data input, but also to face the output of large data, or even output than the input volume is larger. and need to provide faster output service under high load.

The solution we developed has a thousands of dollar infrastructure cost per month, thanks to the use of cloud (AWS), Hadoop, and Voldemort. In the following sections, we will explain the main features of the proposed architecture.

Data, goals and primary decisions

The system utilizes BBVA's credit card to store transaction information around the world as an input source for analysis. Clearly, in order to prevent privacy issues, the data is anonymous, objective and separate, and the credit card number is cut. Any resulting opinion is always gathered so that no personal information can be derived from it.

We calculate a lot of statistics and data for each store and for each different time period. Here are some of them:

• Histogram of payment amount per store

• Client Fidelity

• Client Demographics

• Store advice (customers who buy also buy ...), filtered locations, and store categories

The main goal of the project is to provide all of this information to different agents (stores, customers) through low-latency networks and mobile applications. As a result, a demanding requirement is to be able to provide a second-level delay service under high load. Because this is a research project, there is a high degree of flexibility in the code and requirements that need to be addressed.

Since the updated data is not a problem every time, we have chosen a batch-oriented architecture (HADOOP). And we use Voldemort as a read-only store to serve the insights generated by Hadoop, which is a simple and super fast key/value store.

Platform

The system is built on Amazon Web Services. Specifically, we use S3 to store raw input data, analyze it with elastic MapReduce (Hadoop provided by Amazon) and serve the results with EC2. Using cloud technology enables us to quickly iterate and quickly deliver functional prototypes, which is exactly what we need.

Architecture

The architecture has three main sections:

• Data storage: Users keep raw data (credit card transactions) and get Voldemort stores.

• Data processing: The workflow of Hadoop runs on EMR, performs all calculations, and creates the required data store through Voldemort.

• Data Services: A Voldemort cluster provides predefined data from the data processing layer.

Every day, the bank uploads all transactions that occurred on that day to a folder on S3. This allows us to keep all the historical data--the transactions that all credit cards perform every day. All of this data is entered into the processing layer, so we recalculate everything every day and then process the data, and we can be very flexible. If the requirements change or if we find a stupid mistake, we just need to update the project code and all the fixed data in the next batch. This allows us to make a development decision:

• An infrastructure that simplifies code

• Changes in flexibility and adaptability

• Easy to manipulate human error (just fix the bug and restart the process)

Every day, the controller starts a new Hadoop cluster on the EMR and initiates the processing process. This process consists of about 16 groups of mapreduce work, calculating various scenarios. The final part of the process (Voldemort index) is responsible for building the data storage files that will be deployed to Voldemort later. Once the process is finished, the resulting data store file is uploaded to the S3. The controller shuts down the Hadoop cluster and sends a deployment request to Voldemort. Voldemort then downloads the new data store from the S3 and performs a hot swap that completely replaces the old data.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More