Crawling Data using Apache Flume (1)

Source: Internet
Author: User


Using Apache flume crawl data, how to crawl it? But before we get to the point, we have to be clear about what Apacheflume is.

First, what is Apache Flume

Apache Flume is a high-performance system for data acquisition, named after the original near real-time log data acquisition tool, which is now widely used for any stream event data acquisition and supports aggregating data from many data sources into HDFs.

Originally developed by Cloudera, it was contributed to the Apache Foundation in 2011 and became the top project of Apache in 2012, and Flume og was upgraded to Flume NG.

Flume benefits of scale-out, ductility, and reliability

Second, FLUME system structure

Source : Accept External System generation event

Sink : Send event to the specified destination

Channel : Caches the event from source until sink takes the event away

Agent : A separate flume process containing the source,channel and sink components

Iii. Flume Design Objective: Reliability

Channels provides flume reliability assurance, so what is the way to protect it? The default mode is the memory channel,memory Channel is RAM, all the data is stored in memory. So, there's a problem here? If the channel node loses power, the data is lost. To solve this problem, there is another pattern, that is, disk-based channel, disk-based queues to ensure that the data is not lost when a power outage occurs.

In addition, the data transfer between the agent and the channel is transactional, and the information transmitted to the downstream agent fails to be rolled back and retried. The same task can be configured with multiple agents,

For example, two agents complete a data acquisition job, and if one agent fails, the upstream agent fails to switch to the other.

Iv. Flume Design Objectives: Extensibility

When we collect very much data, we can add more system resources to increase system performance linearly. Moreover, the flume can scale horizontally, and as the complexity increases, more machines can be added to the configuration.

V. Flume DESIGN Objectives: ductility

Extensibility is the ability to add new functionality to the system. Flume by adding sources and sinks to an existing storage tier or data platform, common sources includes data from the files, syslog, and standard outputs of any linux process; Common sinks includes a local file system or HDFS, Developers can write their own sources or sinks.

Vi. Common flume data sources

650) this.width=650; "Src=" https://s1.51cto.com/wyfs02/M00/8C/EE/wKioL1h-4wvhsQpaAAIH_BaB4aM451.png-wh_500x0-wm_ 3-wmp_4-s_118016703.png "title=" 11.png "alt=" Wkiol1h-4wvhsqpaaaih_bab4am451.png-wh_50 "/>

VII. Large-scale deployment examples

Flume using agents to collect data, agents can receive data from many sources, including other agents. Large-scale deployments use multiple tiers for scalability and reliability, and Flume supports the inspection and modification of data in transit.

650) this.width=650; "Src=" https://s4.51cto.com/wyfs02/M01/8C/F2/wKiom1h-4y_iAbbFAAHi4tDOBA0435.png-wh_500x0-wm_ 3-wmp_4-s_3043893192.png "title=" 22.png "alt=" Wkiom1h-4y_iabbfaahi4tdoba0435.png-wh_50 "/>

These are some details about Apache Flume, which will continue to be shared later. Big data will be the future of the tuyere, to be very good standing in the tuyere, it is necessary to continue to study and efforts, it is recommended that you pay attention to a public "big Data cn", there is a lot of knowledge about big data introduction, for people who want to understand and learn big data is a good platform.


This article is from the "11872756" blog, please be sure to keep this source http://11882756.blog.51cto.com/11872756/1892824

Crawling Data using Apache Flume (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.