Objective
In order to run Summingbird demo, the author has gone a lot of detours, and in the country is basically not access to any information, took a long time to fix the demo run. Really is a bitter tears, interested in want to study Summingbird and listen to the author of the one by one Tao, the general can be summingbird understand as Storm + Hadoop.
A quick preview of Big Data processing
The advent of the era of big data, the large-scale processing is divided into batch processing and real-time processing two directions, the advantage of batch processing is good fault-tolerant, because the data when there is a local or distributed storage, you can repeat the data processing, the disadvantage is that the speed is slow, to wait until the data are all deposited before the batch processing. For real-time processing, the advantage is fast, real-time calculation, disadvantage is bad fault tolerance, because the data flow into the memory and then out, filtering out useful data, rather than all the data to disk processing, so when you want to run the previous data is impossible, that is, its processing data is not available. Batching or real-time processing is becoming more and more difficult to meet the diverse needs, it is bound to combine the two to deal with. It maintains the fault tolerance of batch processing, and maintains real-time processing in real time. The following is the protagonist of this article-summingbird, Seamless integration of batch computing and real-time computing.
Second, learning Summingbird need to build the environment
The author of the Machine OS for Linux, to run Summingbird, build the environment of the machine is as follows:
1.zookeeper
2.kafka
3.memcached
Second, the skills needed to learn Summingbird
1. There should be some understanding of SBT
2. Familiarity with the Scala language
3. How storm and Hadoop work should be more familiar
Third, the exploration of the demo run
Interested Park friends can search for Summingbird on GitHub and have a general understanding of them. Of course, you can follow the official GitHub tutorial to run the demo, if successful, there will be no results, because the existence of GFW, leading to the official tutorial of the Twitter stream will not be able to successfully access the program. So certainly is not running, the author just started when also tried, have failed, and then constantly Google, and on Twitter constantly asked the project initiator. and began to try again, and ended in failure. Then GitHub found an example that combines storm and Hadoop, so the heart is a happy, continue to start research, follow the step by step, and finally, the result has failed. See the error is that because some of the jar package is not available, or GFW, not in the Twitter Maven repository to obtain the corresponding jar package, because the author did not study maven and SBT, then began to learn SBT and Maven, of course, there is no special in-depth study, Just master some basic usage and be able to read SBT files and maven files. After opening the project's SBT file, it was found that the library on which it depended was walled and began to change to the MAVEN repository in Oschina.
Four, finally successfully run
Specific project code has been hosted on GitHub above, just follow the steps, you can get the correct results, but also hope that you can have a lot of advice. The next step is to start importing data from the local database for processing.
Five, experience
Learning Big Data involved in the knowledge is really very broad, to master a lot, so it must be a solid research. I have to say that China's firewall gfw is indeed the reputation, the bad. While protecting the network, it does give developers some unnecessary trouble. However, the final success of the operation.
The GitHub path is as follows: Https://github.com/leesf/summingbird-hybrid-example-china
You are also welcome to the Park Friends Fork and add star
Demo Run for Summingbird (Storm + Hadoop)