Crawling Data using Apache Flume (1)

Last Update:2017-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Using Apache flume crawl data, how to crawl it? But before we get to the point, we have to be clear about what Apacheflume is.

First, what is Apache Flume

Apache Flume is a high-performance system for data acquisition, named after the original near real-time log data acquisition tool, which is now widely used for any stream event data acquisition and supports aggregating data from many data sources into HDFs.

Originally developed by Cloudera, it was contributed to the Apache Foundation in 2011 and became the top project of Apache in 2012, and Flume og was upgraded to Flume NG.

Flume benefits of scale-out, ductility, and reliability

Second, FLUME system structure

Source : Accept External System generation event

Sink : Send event to the specified destination

Channel : Caches the event from source until sink takes the event away

Agent : A separate flume process containing the source,channel and sink components

Iii. Flume Design Objective: Reliability

Channels provides flume reliability assurance, so what is the way to protect it? The default mode is the memory channel,memory Channel is RAM, all the data is stored in memory. So, there's a problem here? If the channel node loses power, the data is lost. To solve this problem, there is another pattern, that is, disk-based channel, disk-based queues to ensure that the data is not lost when a power outage occurs.

In addition, the data transfer between the agent and the channel is transactional, and the information transmitted to the downstream agent fails to be rolled back and retried. The same task can be configured with multiple agents,

For example, two agents complete a data acquisition job, and if one agent fails, the upstream agent fails to switch to the other.

Iv. Flume Design Objectives: Extensibility

When we collect very much data, we can add more system resources to increase system performance linearly. Moreover, the flume can scale horizontally, and as the complexity increases, more machines can be added to the configuration.

V. Flume DESIGN Objectives: ductility

Extensibility is the ability to add new functionality to the system. Flume by adding sources and sinks to an existing storage tier or data platform, common sources includes data from the files, syslog, and standard outputs of any linux process; Common sinks includes a local file system or HDFS, Developers can write their own sources or sinks.

Vi. Common flume data sources

650) this.width=650; "Src=" https://s1.51cto.com/wyfs02/M00/8C/EE/wKioL1h-4wvhsQpaAAIH_BaB4aM451.png-wh_500x0-wm_ 3-wmp_4-s_118016703.png "title=" 11.png "alt=" Wkiol1h-4wvhsqpaaaih_bab4am451.png-wh_50 "/>

VII. Large-scale deployment examples

Flume using agents to collect data, agents can receive data from many sources, including other agents. Large-scale deployments use multiple tiers for scalability and reliability, and Flume supports the inspection and modification of data in transit.

650) this.width=650; "Src=" https://s4.51cto.com/wyfs02/M01/8C/F2/wKiom1h-4y_iAbbFAAHi4tDOBA0435.png-wh_500x0-wm_ 3-wmp_4-s_3043893192.png "title=" 22.png "alt=" Wkiom1h-4y_iabbfaahi4tdoba0435.png-wh_50 "/>

These are some details about Apache Flume, which will continue to be shared later. Big data will be the future of the tuyere, to be very good standing in the tuyere, it is necessary to continue to study and efforts, it is recommended that you pay attention to a public "big Data cn", there is a lot of knowledge about big data introduction, for people who want to understand and learn big data is a good platform.

This article is from the "11872756" blog, please be sure to keep this source http://11882756.blog.51cto.com/11872756/1892824

Crawling Data using Apache Flume (1)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawling Data using Apache Flume (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Crawling Data using Apache Flume (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support