"Summarize" Amazon kinesis real-time data analytics best practices sharing

Source: Internet
Author: User
Tags dynamodb

1. Churyang Summary
    • AWS Services are all based on SOA architectures that can be called when needed
    • For real-time streaming of Big data, AWS offers both legacy and full host scenarios
      • The legacy scenario is EC2 Deployment Flume (Capture), Kafka (data Dump), Storam (Stream processing)
      • The full host scenario is kinesis
    • The use of Kinesis still requires the user through the API to the mobile phone, website Click, IoT, sensors and other data sources of data access
    • Allows users to write kinesis worker to handle custom data processing logic (extensibility)
    • Kinesis data after processing, AWS recommends storing storage in S3 or redshift, and subsequent use
    • The typical usage of kinesis is that the front-end data source →kinesis Stream processing →S3 save temporary data →emr data processing →redshift do BI analysis. The overall use of CW to do the operation monitoring, and can be set off autoscaling to elastic scaling processing capacity

    • Kinesis Real-time data stream application scenarios
      • for advertising platform : The user's behavior on the internet, can affect the advertising content in real-time, the next time users refresh the page, will provide users with new ads
      • for e-commerce : Users of each collection, click, purchase behavior, can be quickly into his personal model, immediately corrected the product recommendation
      • for social networks : User Social map changes and speech behavior can be quickly reflected in his friend referral, hot topic reminders
2. Overview 2.1.AWS cloud-based full big data Services
    • Acquisition: real-time data stream acquisition and processing (Kinesis)
    • Save: Large-scale storage
      • Dynonamodb
      • S3
      • Glacier
    • Processing: Large cluster parallel computing
      • Emr
      • EC2
      • REDSHIFT–MPP Database
      • Data Pipeline–etl Tools

Big Data customers for 2.2.AWS

Including: Pharmaceuticals, Internet companies, large enterprises

3. Big data analysis and processing 3.1. Challenges of large-size processing

The life cycle of Big data: Collection → storage → analytics → Insights

Success story: Supercell Hand tour Company
-Collection: Real-time data acquisition Kinesis
-Storage: 4T Storage/day →S3
-long-term archive glacier
-Analytics: Data Mining Hadoop

3.2. Real-time data stream processing use cases
    • for advertising platform : The user's behavior on the internet, can affect the advertising content in real-time, the next time users refresh the page, will provide users with new ads
    • for e-commerce : Users of each collection, click, purchase behavior, can be quickly into his personal model, immediately corrected the product recommendation
    • for social networks : User Social map changes and speech behavior can be quickly reflected in his friend referral, hot topic reminders
4. A typical real-time dynamic Data stream processing architecture and workflow

1) Data acquisition: responsible for collecting and processing data from each node in real time, such as selecting Flume (Cloudeara) to achieve
2) data access: Due to the speed of data acquisition and processing speed is not necessarily synchronous, so add a message middleware as a buffer, such as Apache Kafka (LinkedIn provided)
3) Streaming calculation: real-time analysis of collected data, e.g. using Apache's Storm (Twitter provided)

5. Processing on AWS (Simple Mode)

1) Data acquisition: Build collectors on EC2 servers (Kafka,fluedtd,scribe and flume, etc.)

2) Load data-deposit data into S3
Not recommended for local disks because capacity scalability is not guaranteed and durability is difficult to secure

6. Processing on AWS (Kinesis mode)

Real-time data processing Amazon Kinesis

    • Real-time data acquisition, ingestion, transmission
    • Process real-time Dynamic Data flow
    • Parallel Write and write
    • Supports data output to different storage destinations

The architecture model for Amazon Kinesis is as follows

Operation Flow

1) Creating a data stream (that is, using Storm), making Shard (sharding)
-Shard: Shard is the basic throughput unit of a kinesis data stream
-One Shard provides: 1mb/sec data Input (write) capacity =1000tps,2mb/seconds Data read-in (read) capacity =5tps

2) Set the capacity of a single piece of data (for example: 140 bytes per record for Twitter), set the amount of simultaneous writes per second (for example, 5,000 such shards)

After specifying the number of shards, the capacity of a single piece of data, and the amount of simultaneous writes, the following throughput is automatically calculated

3) Monitoring of the operation in the CW at a later stage

Customer case "Cartoon Farm"

    • Simply call the put command to ingest data dynamically
    • 1MB data per second (tall 1000TPS) each shard can ingest
    • (sudden player burst) dynamically expands the number of Shard in the non-stop state

6.1. Entering data into Kinesis data stream
    • Putrecord API for adding data to Amazon Kinesis data streams
    • Specify the name and partition key of the data stream (partition key)
    • Partitioning keys are used to assign data records to different data stream shards

6.2. Real-time data stream processing
    • Distributed processing multi-shards
    • Fault tolerant
    • Real-time dynamic expansion workers
    • Focus on data processing logic

6.3. Processing data from Amazon Kinesis data streams

Amazon kinesis Applications (workers) allow users to develop their own

    • Read and process the consumer from the data stream Strom data
    • Use the Kinesis client Cry (KCL) to build an application to perform the tedious tasks of distributed stream processing
    • Automatic expansion Group (autoscaling) real-time dynamic expansion

6.4.Amazon Kinesis vs Storm
    • Storm
      • Deploy acquisition tools, such as Flume
      • Deploy data access tools, such as Kafka
      • Deploy real-time analytics tools, such as Storm
    • Kinesis
      • Automatic configuration of acquisition, access, analysis tools
      • Automatic scaling, fault tolerance
      • converged with other AWS services, such as S3, Redshift, DynamoDB
6.5. Real-time data stream processing & mass data storage cases
    • Supercell user taps The live stream of the screen, writes Kinesis
    • The worker application is responsible for processing this data
    • Aggregate data preprocessing write S3
    • Real-time trend analysis table making (e.g. number of players, use of virtual props, etc.)
    • Glacier can be archived.
    • Hadoop can do data mining (EMR gets data from S3)
    • Put the data after the Hadoop process into redshift for BI analysis

Common CDP architecture for Kinesis on 6.6.AWS

#1 Click Stream Analytics

#2 Payment

7. Summary
    • Collect and process data in real time
    • Easy to use

      • Easily build applications with Java, Python, and KCl
      • Integration with S3, Redshift, Dynamodb and other service tools
    • Parallel processing

      • Aggregated data is sent to the S3 storage object
      • Analyze logs in real time and trigger alerts when exceptions occur
      • Real-time analytics site Click Stream
    • Flexible strain

      • Dynamically adjust the throughput of kinesis data streams
    • Reliable

      • Three AZ synchronously replicates data and retains 24 hours to prevent data loss after an application failure

"Summarize" Amazon kinesis real-time data analytics best practices sharing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.