In today's world, data is money. Companies are trying to collect as much data as possible and try to find patterns hidden in the data, and thus earn revenue through these patterns. However, if you fail to use the collected data, or if you fail to discover hidden gems by analyzing the data, the data is worthless.
When you start using Hadoop to build large data solutions, it's one of the biggest challenges to understand how to leverage the tools in your hands and connect them together. The Hadoop ecosystem includes many different open source projects. How do we choose the right tools?
Another data management system
Most data management systems can be divided into three modules: Data acquisition (ingestion), data storage (Storage) and data analysis. The flow of information between these modules can be shown in the following figure:
The data acquisition system is responsible for connecting the static storage location of the data source and data. The data analysis system is used to process the data and gives a feasible view. To convert to a relational schema, we can replace it with a generic term:
We can also map this basic architecture of access, storage, and processing to the Hadoop ecosystem, as follows:
Of course, this is not the only Hadoop architecture. By introducing other projects in the ecosystem, we can build more complex projects. But this is really the most common Hadoop architecture and can be a starting point for us to enter the big data world. In the remainder of this article, we'll complete an example application that uses Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data-processing pipeline system, We can then use it for the analysis of Twitter data. All necessary code and instructions for implementing the system can be downloaded from the Cloudera GitHub:
Https://github.com/cloudera/cdh-twitter-example
Motivation: Measuring Influence
Social media is popular with the marketing team, and Twitter is an effective tool to get the public's enthusiasm for the product. Using Twitter, it is easier to attract users and communicate directly with users, and in turn, the user's discussion of the product will form Word-of-mouth marketing. Marketing can be more efficient when resources are limited and determined not to communicate directly with everyone in the target group, by discriminating between people who are accessible.
To see which people are our target group, let's look at how Twitter works. A user--joe--, for example, pays attention to some people, and some people pay attention to him. When Joe publishes an update, all the followers can see the update. Joe can also forward updates for other users. If Joe sees a tweet from Sue and forwards it, all of Joe's followers can see Sue's tweet, even if they don't pay attention to sue. By forwarding, messages are not only transmitted to the original sender's followers, but can also be transmitted farther. With this in mind, we can try to attract those users who have a very large update forwarding volume. Because Twitter will track the number of tweets forwarded, we can find the users we are looking for by analyzing the Twitter data.
Now we know the question we want to ask: Which Twitter user is forwarding the most information? Who is more influential in our industry?
How to answer these questions?
You can use SQL queries to answer this question by descending the forwarding order, and we want to find out which users are causing the maximum amount of forwarding. However, it is not convenient to query Twitter data in a traditional relational database because the Twitter streaming API outputs tweets in JSON format, which can be very complicated. In the Hadoop ecosystem, the Hive project provides an interface for querying data in HDFs. Hive's query language is very similar to SQL, but it's easy to model complex types, so we can easily query the type of data we own. It seems to be a good starting point. So how do you import Twitter data into hive? First, we need to import the Twitter data into HDFs and then tell the location of the hive data and how to read it.
To answer the above question, we need to build a data pipeline, which is a high-level view of some CDH components.