mapreduce simplified data processing on large clusters
mapreduce simplified data processing on large clusters
Discover mapreduce simplified data processing on large clusters, include the articles, news, trends, analysis and practical advice about mapreduce simplified data processing on large clusters on alibabacloud.com
Directory address for this book Note: http://www.cnblogs.com/mdyang/archive/2011/06/29/data-intensive-text-prcessing-with-mapreduce-contents.html
Currently, the most effective way to process large-scale data is to divide and conquer it ".
Divide and conquer: divide a major problem into several small problems that are r
computing model, and programmers can use Hadoop to write programs that run on computer clusters to handle massive amounts of data.
In addition, Hadoop provides a distributed file System (HDFS) and distributed Database (Hbase) for storing or deploying data to individual compute nodes. So, you can think of it roughly:Hadoop=HDFS(file system,
traverse the dataset repeatedly (for example, normalization, which requires scanning the calculation sum once and then performing the division for the second scan ), by cleverly defining sorting rules (using reverse order mode), you can avoid the time overhead of repeated traversal and the space overhead of maintaining statistical results.
Section 3.4A general secondary sorting method is introduced, which can be used to sort the sequence of the key-value accepted by the reducer with the same ke
3.2 pairs (pair) and stripes (stripe)
A common practice of synchronization in mapreduce programs is to adapt data to the execution framework by building complex keys and values. We have covered this technology in the previous chapter, that is, "package" the total number and count into a composite value (for example, pair) from Mapper to combiner and then to Cer. Based on previous publications (54,94), this
segmentation and sorting, reducers is used to generate data that participates in the next map-side connection and cannot send any key except the one it is processing.
3.5.3 memory-based connections (memory-backed join)
In addition to the two methods mentioned earlier, connect relevant data and balance the mapreduce
3.1 local Aggregation)
In a data-intensive distributed processing environment, interaction of intermediate results is an important aspect of synchronization from processes that generate them to processes that consume them at the end. In a cluster environment, except for the embarrassing parallel problem, data must be transmitted over the network. In addition, in
Directory address for this book Note: http://www.cnblogs.com/mdyang/archive/2011/06/29/data-intensive-text-prcessing-with-mapreduce-contents.html2.3 execution framework
The greatest thing about mapreduce is that it separates parallel algorithm writing.WhatAndHow(You only need to write a program without worrying about how to execute it)The execution framework make
Directory address for this book Note: http://www.cnblogs.com/mdyang/archive/2011/06/29/data-intensive-text-prcessing-with-mapreduce-contents.html
2.5 Distributed File System HDFSTraditional large-scale data processing problems from the perspective of
, Wikipedia, and river This feature will be highlighted in a later document.GatewayRepresents the persistent storage of ES indexes, es default is to store the index in memory, and then persist to the hard disk when the memory is full. When the ES cluster is shut down and restarted, the index data is read from the gateway. ES supports multiple types of gateway, with local file system (default), Distributed File System, Hadoop HDFs and Amazon's S3 cloud
Directory address for this book Note: http://www.cnblogs.com/mdyang/archive/2011/06/29/data-intensive-text-prcessing-with-mapreduce-contents.html
3.4 secondary sorting
Before intermediate results enter CER, mapreduce first sorts these intermediate results and then distributes them. This mechanism is very convenient for reduce operations that depend on the inp
, Zebra) are specified to the same CER Cer. To generate the expected behavior, we must define a custom partitioner to focus only on the words on the left. That is to say, partitioner should only be split Based on the hash algorithm of the words on the left.
This algorithm can work, but it has the same disadvantage as the stripes method: when the number of document sets increases, the dictionary size also increases, in some cases, there may be insufficient memory to store the number of all c
between tasks on the same node.The limitations of the 2.3 MapReduce architecture show that the original Map-reduce architecture is straightforward, and in the first few years, many successful cases have been obtained, with the industry's broad support and affirmation, but with the scale of distributed systems clusters and the growth of their workloads, The problems of the original framework surfaced gradua
to have the framework of Apache Pig and Apache hive break down query statements and scripts into efficient query processing pipelines rather than a series of mapreduce jobs, which are usually very slow. Because of the need to store intermediate results.
Apache spark[10]. The Spark project is probably the most advanced and promising unified big data
I said big data processing refers to the need to search the data at the same time, there are high concurrent additions and deletions to modify the operation. Remember before in XX to do power, millions of data, then a search query can let you wait for you minutes. Now I want to discuss the
In fact, MapReduce in MongoDB is more similar to GroupBy in relational databases.
Just after doing this experiment, the GroupBy (MapReduce) for large data volumes is still ideal, generating million 3-bit random strings
For(VarI = 0; I
{
VarX ="0123456789";
VarTmp ="";
For(VarJ = 0; j
{
Tmp + = x. charAt (Math
1. Description
In real-world scenarios there can be some format of more structured data files that need to be imported into Hbase,phoenix to provide two ways to load CSV formatted files in Phoenix's data sheet. One is the way to load small batches of data using a single-threaded psql tool, one that uses mapreduce
mechanisms in mapreduce. Is the sequence of data arriving at Cer CER correct. If it is possible to make the boundary value calculated (or used) in cer CER in some way before processing the joint count, reducer can simply split the joint count by the boundary value to calculate the relative frequency. Notifications of "before" and "after" can be captured in the o
typical, so we describe them as canonical models as an abstract problem statement.
The following figure shows a high-level overview of our production environment:
This is a typical large data infrastructure: each application in multiple data centers is producing data, the data
data before it meets the data required in this computation to Cer CER for processing.
"Value-to-key conversion", which provides a scalability solution for secondary sorting. By moving the part score to the key, we can use the mapreduce method to sort it.
Finally, the control of synchronization in the
Recently bought a "Python treasure" in the read, this book tells the breadth of Python knowledge, but the depth is slightly insufficient, so more suitable for beginners and improve the level of readers. One of the Python Big Data processing chapter of the content of interest, after reading, I based on the book provided in the case of the source code has been modified, but also to achieve the simulation of
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.