Today, the big data has become the theme of the Times, enterprises on the application of large data is also more in-depth, with the popularity of large data, there are many large data concepts need to be questioned, first of all is that people generally think you can simply use Hadoop, and Hadoop easy to use.
The problem is that Hadoop is a technology, and big data and technology are irrelevant. Large data is related to http://www.aliyun.com/zixun/aggregation/12445.html "> Business requirements. In fact, large data should include Hadoop and relational databases, as well as any other technology appropriate to our task at hand.
For example, it makes sense to do a wide and exploratory analysis of a dataset in Hadoop, but relational storage is better at running analysis of things that have not yet been discovered. Hadoop is good for finding the lowest level of detail in a data set, but relational databases are more meaningful for storing transformations and summaries of data. So the bottom line is, you need to use the right technology for whatever you need.
For the techniques and methods of how Hadoop combines and processes large data, data expert Anoop once mentioned in another article that, in general, in order to get the final result, the data needs to be processed and federated together by adding multiple datasets. There are many ways to add multiple datasets to Hadoop. The MapReduce provides a data connection to the map end and the reduce end. These connections are nontrivial and can be very expensive operations. Pig and Hive also have the same ability to apply to connect to multiple datasets. Pig provides a replication connection, a merge connection and an oblique connection (skewed join), and Hive provides a map-side connection and a full external connection to analyze the data.
In a world of big data/hadoop, some problems may not be complicated, and the solution is straightforward, but the challenge is the amount of data. In this case, different solutions are needed to solve the problem. Some analysis tasks are to count the number of clear IDs in the log files, to transform the stored data within a specific date range, and to rank users. All of these tasks can be addressed through a variety of tools and techniques in Hadoop such as MapReduce, Hive, Pig, Giraph, and Mahout. These tools have the flexibility to extend their capabilities with the help of custom routines.
Hadoop is a framework, not a solution, in solving large data analysis, people mistakenly believe that Hadoop can work immediately, but in fact for simple queries, it is possible. But for difficult analysis problems, Hadoop will quickly fail, as it requires you to develop map/reduce code directly. For this reason, Hadoop is more like the Java EE programming environment than the Business Analytics solution. "The so-called framework means that you have to do personalized and business-related development and implementation on top of it, and these all require cost."
Hadoop is an excellent tool for doing some very complex data analysis. Ironically, however, it also requires a lot of programming work to get answers to these questions. This is not only in the application of data analysis, it actually reflects the current use of open source framework to face the problem of selection balance. When you're choosing an open source framework or code, think about how much it can help you, how much time and cost, and how much more efficient. Also know how much of the new costs are generated by this, for example, engineers ' learning costs, development and maintenance costs, and future scalability, including the need to upgrade your and your team if the framework is used, and even security considerations, the open source framework flaw is well known.
Comments:
In the big data age, many people think Hadoop is easy to use, so in the application of large data, Hadoop is also in the forefront, however, Hadoop will also face unresolved problems, Hadoop is not omnipotent, so users in the process of using Hadoop should be able to do.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.