1. Churyang Summary
- AWS Services are all based on SOA architectures that can be called when needed
- For real-time streaming of Big data, AWS offers both legacy and full host scenarios
- The legacy scenario is EC2 Deployment Flume (Capture), Kafka (data Dump), Storam (Stream processing)
- The full host scenario is kinesis
- The use of Kinesis still requires the user through the API to the mobile phone, website Click, IoT, sensors and other data sources of data access
- Allows users to write kinesis worker to handle custom data processing logic (extensibility)
- Kinesis data after processing, AWS recommends storing storage in S3 or redshift, and subsequent use
- The typical usage of kinesis is that the front-end data source →kinesis Stream processing →S3 save temporary data →emr data processing →redshift do BI analysis. The overall use of CW to do the operation monitoring, and can be set off autoscaling to elastic scaling processing capacity
- Kinesis Real-time data stream application scenarios
- for advertising platform : The user's behavior on the internet, can affect the advertising content in real-time, the next time users refresh the page, will provide users with new ads
- for e-commerce : Users of each collection, click, purchase behavior, can be quickly into his personal model, immediately corrected the product recommendation
- for social networks : User Social map changes and speech behavior can be quickly reflected in his friend referral, hot topic reminders
2. Overview 2.1.AWS cloud-based full big data Services
- Acquisition: real-time data stream acquisition and processing (Kinesis)
- Save: Large-scale storage
- Processing: Large cluster parallel computing
- Emr
- EC2
- REDSHIFT–MPP Database
- Data Pipeline–etl Tools
Big Data customers for 2.2.AWS
Including: Pharmaceuticals, Internet companies, large enterprises
3. Big data analysis and processing 3.1. Challenges of large-size processing
The life cycle of Big data: Collection → storage → analytics → Insights
Success story: Supercell Hand tour Company
-Collection: Real-time data acquisition Kinesis
-Storage: 4T Storage/day →S3
-long-term archive glacier
-Analytics: Data Mining Hadoop
3.2. Real-time data stream processing use cases
- for advertising platform : The user's behavior on the internet, can affect the advertising content in real-time, the next time users refresh the page, will provide users with new ads
- for e-commerce : Users of each collection, click, purchase behavior, can be quickly into his personal model, immediately corrected the product recommendation
- for social networks : User Social map changes and speech behavior can be quickly reflected in his friend referral, hot topic reminders
4. A typical real-time dynamic Data stream processing architecture and workflow
1) Data acquisition: responsible for collecting and processing data from each node in real time, such as selecting Flume (Cloudeara) to achieve
2) data access: Due to the speed of data acquisition and processing speed is not necessarily synchronous, so add a message middleware as a buffer, such as Apache Kafka (LinkedIn provided)
3) Streaming calculation: real-time analysis of collected data, e.g. using Apache's Storm (Twitter provided)
5. Processing on AWS (Simple Mode)
1) Data acquisition: Build collectors on EC2 servers (Kafka,fluedtd,scribe and flume, etc.)
2) Load data-deposit data into S3
Not recommended for local disks because capacity scalability is not guaranteed and durability is difficult to secure
6. Processing on AWS (Kinesis mode)
Real-time data processing Amazon Kinesis
- Real-time data acquisition, ingestion, transmission
- Process real-time Dynamic Data flow
- Parallel Write and write
- Supports data output to different storage destinations
The architecture model for Amazon Kinesis is as follows
Operation Flow
1) Creating a data stream (that is, using Storm), making Shard (sharding)
-Shard: Shard is the basic throughput unit of a kinesis data stream
-One Shard provides: 1mb/sec data Input (write) capacity =1000tps,2mb/seconds Data read-in (read) capacity =5tps
2) Set the capacity of a single piece of data (for example: 140 bytes per record for Twitter), set the amount of simultaneous writes per second (for example, 5,000 such shards)
After specifying the number of shards, the capacity of a single piece of data, and the amount of simultaneous writes, the following throughput is automatically calculated
3) Monitoring of the operation in the CW at a later stage
Customer case "Cartoon Farm"
- Simply call the put command to ingest data dynamically
- 1MB data per second (tall 1000TPS) each shard can ingest
- (sudden player burst) dynamically expands the number of Shard in the non-stop state
6.1. Entering data into Kinesis data stream
- Putrecord API for adding data to Amazon Kinesis data streams
- Specify the name and partition key of the data stream (partition key)
- Partitioning keys are used to assign data records to different data stream shards
6.2. Real-time data stream processing
- Distributed processing multi-shards
- Fault tolerant
- Real-time dynamic expansion workers
- Focus on data processing logic
6.3. Processing data from Amazon Kinesis data streams
Amazon kinesis Applications (workers) allow users to develop their own
- Read and process the consumer from the data stream Strom data
- Use the Kinesis client Cry (KCL) to build an application to perform the tedious tasks of distributed stream processing
- Automatic expansion Group (autoscaling) real-time dynamic expansion
6.4.Amazon Kinesis vs Storm
- Storm
- Deploy acquisition tools, such as Flume
- Deploy data access tools, such as Kafka
- Deploy real-time analytics tools, such as Storm
- Kinesis
- Automatic configuration of acquisition, access, analysis tools
- Automatic scaling, fault tolerance
- converged with other AWS services, such as S3, Redshift, DynamoDB
6.5. Real-time data stream processing & mass data storage cases
- Supercell user taps The live stream of the screen, writes Kinesis
- The worker application is responsible for processing this data
- Aggregate data preprocessing write S3
- Real-time trend analysis table making (e.g. number of players, use of virtual props, etc.)
- Glacier can be archived.
- Hadoop can do data mining (EMR gets data from S3)
- Put the data after the Hadoop process into redshift for BI analysis
Common CDP architecture for Kinesis on 6.6.AWS
#1 Click Stream Analytics
#2 Payment
7. Summary
- Collect and process data in real time
Easy to use
- Easily build applications with Java, Python, and KCl
- Integration with S3, Redshift, Dynamodb and other service tools
Parallel processing
- Aggregated data is sent to the S3 storage object
- Analyze logs in real time and trigger alerts when exceptions occur
- Real-time analytics site Click Stream
Flexible strain
- Dynamically adjust the throughput of kinesis data streams
Reliable
- Three AZ synchronously replicates data and retains 24 hours to prevent data loss after an application failure
"Summarize" Amazon kinesis real-time data analytics best practices sharing