"Summarize" Amazon kinesis real-time data analytics best practices sharing

Last Update:2016-06-12 Source: Internet

Author: User

Tags dynamodb

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Churyang Summary

AWS Services are all based on SOA architectures that can be called when needed
For real-time streaming of Big data, AWS offers both legacy and full host scenarios
- The legacy scenario is EC2 Deployment Flume (Capture), Kafka (data Dump), Storam (Stream processing)
- The full host scenario is kinesis
The use of Kinesis still requires the user through the API to the mobile phone, website Click, IoT, sensors and other data sources of data access
Allows users to write kinesis worker to handle custom data processing logic (extensibility)
Kinesis data after processing, AWS recommends storing storage in S3 or redshift, and subsequent use
The typical usage of kinesis is that the front-end data source →kinesis Stream processing →S3 save temporary data →emr data processing →redshift do BI analysis. The overall use of CW to do the operation monitoring, and can be set off autoscaling to elastic scaling processing capacity

Kinesis Real-time data stream application scenarios
- for advertising platform : The user's behavior on the internet, can affect the advertising content in real-time, the next time users refresh the page, will provide users with new ads
- for e-commerce : Users of each collection, click, purchase behavior, can be quickly into his personal model, immediately corrected the product recommendation
- for social networks : User Social map changes and speech behavior can be quickly reflected in his friend referral, hot topic reminders

2. Overview 2.1.AWS cloud-based full big data Services

Acquisition: real-time data stream acquisition and processing (Kinesis)
Save: Large-scale storage
- Dynonamodb
- S3
- Glacier
Processing: Large cluster parallel computing
- Emr
- EC2
- REDSHIFT–MPP Database
- Data Pipeline–etl Tools

Big Data customers for 2.2.AWS

Including: Pharmaceuticals, Internet companies, large enterprises

3. Big data analysis and processing 3.1. Challenges of large-size processing

The life cycle of Big data: Collection → storage → analytics → Insights

Success story: Supercell Hand tour Company
-Collection: Real-time data acquisition Kinesis
-Storage: 4T Storage/day →S3
-long-term archive glacier
-Analytics: Data Mining Hadoop

3.2. Real-time data stream processing use cases

for advertising platform : The user's behavior on the internet, can affect the advertising content in real-time, the next time users refresh the page, will provide users with new ads
for e-commerce : Users of each collection, click, purchase behavior, can be quickly into his personal model, immediately corrected the product recommendation
for social networks : User Social map changes and speech behavior can be quickly reflected in his friend referral, hot topic reminders

4. A typical real-time dynamic Data stream processing architecture and workflow

1) Data acquisition: responsible for collecting and processing data from each node in real time, such as selecting Flume (Cloudeara) to achieve
2) data access: Due to the speed of data acquisition and processing speed is not necessarily synchronous, so add a message middleware as a buffer, such as Apache Kafka (LinkedIn provided)
3) Streaming calculation: real-time analysis of collected data, e.g. using Apache's Storm (Twitter provided)

5. Processing on AWS (Simple Mode)

1) Data acquisition: Build collectors on EC2 servers (Kafka,fluedtd,scribe and flume, etc.)

2) Load data-deposit data into S3
Not recommended for local disks because capacity scalability is not guaranteed and durability is difficult to secure

6. Processing on AWS (Kinesis mode)

Real-time data processing Amazon Kinesis

Real-time data acquisition, ingestion, transmission
Process real-time Dynamic Data flow
Parallel Write and write
Supports data output to different storage destinations

The architecture model for Amazon Kinesis is as follows

Operation Flow

1) Creating a data stream (that is, using Storm), making Shard (sharding)
-Shard: Shard is the basic throughput unit of a kinesis data stream
-One Shard provides: 1mb/sec data Input (write) capacity =1000tps,2mb/seconds Data read-in (read) capacity =5tps

2) Set the capacity of a single piece of data (for example: 140 bytes per record for Twitter), set the amount of simultaneous writes per second (for example, 5,000 such shards)

After specifying the number of shards, the capacity of a single piece of data, and the amount of simultaneous writes, the following throughput is automatically calculated

3) Monitoring of the operation in the CW at a later stage

Customer case "Cartoon Farm"

Simply call the put command to ingest data dynamically
1MB data per second (tall 1000TPS) each shard can ingest
(sudden player burst) dynamically expands the number of Shard in the non-stop state

6.1. Entering data into Kinesis data stream

Putrecord API for adding data to Amazon Kinesis data streams
Specify the name and partition key of the data stream (partition key)
Partitioning keys are used to assign data records to different data stream shards

6.2. Real-time data stream processing

Distributed processing multi-shards
Fault tolerant
Real-time dynamic expansion workers
Focus on data processing logic

6.3. Processing data from Amazon Kinesis data streams

Amazon kinesis Applications (workers) allow users to develop their own

Read and process the consumer from the data stream Strom data
Use the Kinesis client Cry (KCL) to build an application to perform the tedious tasks of distributed stream processing
Automatic expansion Group (autoscaling) real-time dynamic expansion

6.4.Amazon Kinesis vs Storm

Storm
- Deploy acquisition tools, such as Flume
- Deploy data access tools, such as Kafka
- Deploy real-time analytics tools, such as Storm
Kinesis
- Automatic configuration of acquisition, access, analysis tools
- Automatic scaling, fault tolerance
- converged with other AWS services, such as S3, Redshift, DynamoDB

6.5. Real-time data stream processing & mass data storage cases

Supercell user taps The live stream of the screen, writes Kinesis
The worker application is responsible for processing this data
Aggregate data preprocessing write S3
Real-time trend analysis table making (e.g. number of players, use of virtual props, etc.)
Glacier can be archived.
Hadoop can do data mining (EMR gets data from S3)
Put the data after the Hadoop process into redshift for BI analysis

Common CDP architecture for Kinesis on 6.6.AWS

#1 Click Stream Analytics

#2 Payment

7. Summary

Collect and process data in real time
Easy to use
- Easily build applications with Java, Python, and KCl
- Integration with S3, Redshift, Dynamodb and other service tools
Parallel processing
- Aggregated data is sent to the S3 storage object
- Analyze logs in real time and trigger alerts when exceptions occur
- Real-time analytics site Click Stream
Flexible strain
- Dynamically adjust the throughput of kinesis data streams
Reliable
- Three AZ synchronously replicates data and retains 24 hours to prevent data loss after an application failure

"Summarize" Amazon kinesis real-time data analytics best practices sharing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More