Using Apache flume crawl data, how to crawl it? But before we get to the point, we have to be clear about what Apacheflume is.
First, what is Apache Flume
Apache Flume is a high-performance system for data acquisition, named after the original near real-time log data acquisition tool, which is now widely used for any stream event data acquisition and supports aggregating data from many data sources into HDFs.
Originally developed by Cloudera, it was contributed to the Apache Foundation in 2011 and became the top project of Apache in 2012, and Flume og was upgraded to Flume NG.
Flume benefits of scale-out, ductility, and reliability
Second, FLUME system structure
Source : Accept External System generation event
Sink : Send event to the specified destination
Channel : Caches the event from source until sink takes the event away
Agent : A separate flume process containing the source,channel and sink components
Iii. Flume Design Objective: Reliability
Channels provides flume reliability assurance, so what is the way to protect it? The default mode is the memory channel,memory Channel is RAM, all the data is stored in memory. So, there's a problem here? If the channel node loses power, the data is lost. To solve this problem, there is another pattern, that is, disk-based channel, disk-based queues to ensure that the data is not lost when a power outage occurs.
In addition, the data transfer between the agent and the channel is transactional, and the information transmitted to the downstream agent fails to be rolled back and retried. The same task can be configured with multiple agents,
For example, two agents complete a data acquisition job, and if one agent fails, the upstream agent fails to switch to the other.
Iv. Flume Design Objectives: Extensibility
When we collect very much data, we can add more system resources to increase system performance linearly. Moreover, the flume can scale horizontally, and as the complexity increases, more machines can be added to the configuration.
V. Flume DESIGN Objectives: ductility
Extensibility is the ability to add new functionality to the system. Flume by adding sources and sinks to an existing storage tier or data platform, common sources includes data from the files, syslog, and standard outputs of any linux process; Common sinks includes a local file system or HDFS, Developers can write their own sources or sinks.
Vi. Common flume data sources
650) this.width=650; "Src=" https://s1.51cto.com/wyfs02/M00/8C/EE/wKioL1h-4wvhsQpaAAIH_BaB4aM451.png-wh_500x0-wm_ 3-wmp_4-s_118016703.png "title=" 11.png "alt=" Wkiol1h-4wvhsqpaaaih_bab4am451.png-wh_50 "/>
VII. Large-scale deployment examples
Flume using agents to collect data, agents can receive data from many sources, including other agents. Large-scale deployments use multiple tiers for scalability and reliability, and Flume supports the inspection and modification of data in transit.
650) this.width=650; "Src=" https://s4.51cto.com/wyfs02/M01/8C/F2/wKiom1h-4y_iAbbFAAHi4tDOBA0435.png-wh_500x0-wm_ 3-wmp_4-s_3043893192.png "title=" 22.png "alt=" Wkiom1h-4y_iabbfaahi4tdoba0435.png-wh_50 "/>
These are some details about Apache Flume, which will continue to be shared later. Big data will be the future of the tuyere, to be very good standing in the tuyere, it is necessary to continue to study and efforts, it is recommended that you pay attention to a public "big Data cn", there is a lot of knowledge about big data introduction, for people who want to understand and learn big data is a good platform.
This article is from the "11872756" blog, please be sure to keep this source http://11882756.blog.51cto.com/11872756/1892824
Crawling Data using Apache Flume (1)