Large Data instance: Analysis of high load and low latency dynamic algorithm

Source: Internet
Author: User
Keywords We all provide large data

This article, provided by Datasalt's founder Ivan de Prado and Pere Ferrera, is a company focused on large data, launching open source projects such as Datasalt and Pangool SQL big. In this article, a low latency scheme in cloud computing is illustrated by an example of a BBVA credit card payment.

The following is the full text of the article:

The amount of money paid with a credit card is huge, but it is clear that by analyzing all the transactions, we can also get intrinsic value from the data. such as customer loyalty, demographic data, popularity of activities, store recommendations, and many other statistics, are very useful for businesses and banks, and can improve their contact with the market. In Datasalt, we have worked with BBVA to develop a system that can analyze years of data and provide different scenarios and statistics for Web applications and mobile applications.

In addition to the need to deal with the main challenge of large data input, but also to face the output of large data, or even output than the input volume is larger. and need to provide faster output service under high load.

The solution we developed has a thousands of dollar infrastructure cost per month, thanks to the use of cloud (AWS), Hadoop, and Voldemort. In the following sections, we will explain the main features of the proposed architecture.

Data, goals and primary decisions

The system utilizes BBVA's credit card to store transaction information around the world as an input source for analysis. Clearly, in order to prevent privacy issues, the data is anonymous, objective and separate, and the credit card number is cut. Any resulting opinion is always gathered so that no personal information can be derived from it.

We calculate a lot of statistics and data for each store and for each different time period. Here are some of them:

• Histogram of payment amount per store

• Client Fidelity

• Client Demographics

• Store advice (customers who buy also buy ...), filtered locations, and store categories

The main goal of the project is to provide all of this information to different agents (stores, customers) through low-latency networks and mobile applications. As a result, a demanding requirement is to be able to provide a second-level delay service under high load. Because this is a research project, there is a high degree of flexibility in the code and requirements that need to be addressed.

Since the updated data is not a problem every time, we have chosen a batch-oriented architecture (HADOOP). And we use Voldemort as a read-only store to serve the insights generated by Hadoop, which is a simple and super fast key/value store.

Platform

The system is built on Amazon Web Services. Specifically, we use S3 to store raw input data, analyze it with elastic MapReduce (Hadoop provided by Amazon) and serve the results with EC2. Using cloud technology enables us to quickly iterate and quickly deliver functional prototypes, which is exactly what we need.

Architecture

The architecture has three main sections:

• Data storage: Users keep raw data (credit card transactions) and get Voldemort stores.

• Data processing: The workflow of Hadoop runs on EMR, performs all calculations, and creates the required data store through Voldemort.

• Data Services: A Voldemort cluster provides predefined data from the data processing layer.

Every day, the bank uploads all transactions that occurred on that day to a folder on S3. This allows us to keep all the historical data--the transactions that all credit cards perform every day. All of this data is entered into the processing layer, so we recalculate everything every day and then process the data, and we can be very flexible. If the requirements change or if we find a stupid mistake, we just need to update the project code and all the fixed data in the next batch. This allows us to make a development decision:

• An infrastructure that simplifies code

• Changes in flexibility and adaptability

• Easy to manipulate human error (just fix the bug and restart the process)

Every day, the controller starts a new Hadoop cluster on the EMR and initiates the processing process. This process consists of about 16 groups of mapreduce work, calculating various scenarios. The final part of the process (Voldemort index) is responsible for building the data storage files that will be deployed to Voldemort later. Once the process is finished, the resulting data store file is uploaded to the S3. The controller shuts down the Hadoop cluster and sends a deployment request to Voldemort. Voldemort then downloads the new data store from the S3 and performs a hot swap that completely replaces the old data.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.