Google created a mapreduce,mapreduce cluster in 2004 that could include thousands of parallel-operation computers. At the same time, MapReduce allows programmers to quickly transform data and execute data in such a large cluster.
From MapReduce to Hadoop, this has undergone an interesting shift. MapReduce was originally a huge amount of data that helped search engine companies respond to the creation of indexes created by the World Wide Web. Google initially recruited some Silicon Valley elites and hired a large number of engineers to improve mapreduce. And quickly apply technology to related industries, such as finance, retailing, etc. Google has shared some of MapReduce's information with the Nutch team to develop open source versions of Hadoop. But Yahoo will nutch income to its own. Yahoo developed it into a Hadoop open source project in 2007. Hadoop is now increasingly used in large-scale parallel data processing engines for large numbers.
Nowadays everyone is keen on big data fields. Open source projects such as Apache Hive, Pig, and startups like MAPR and Hadapt. It is well known that if the application used for data analysis in MapReduce and Hadoop is too complex to write, then it requires excellent programmers to deal with, which is not conducive to the development of MapReduce technology. So one of the problems that all Hadoop vendors need to address today is how to make mapreduce easier to use.
Large enterprise data and agile large data
From an IT perspective, information structure types have roughly gone through three of waves. It must be noted that the new wave does not replace the old wave, which is still evolving, with three types of data structures always exist, but one type of structure is often dominant in other structures:
Structured information-This information can be found in relational databases and has dominated it applications for years. This is the key task OLTP system business depends on the information, in addition, the structure of database information can be sorted and queried;
Semi-structured information-This is the second wave of it, including e-mail, word processing files, and information stored and posted on the web. Semi-structured information is based on content, can be used for search, which is the reason for Google's existence;
Unstructured information-This information can be considered essentially a bit-mapped data in its essential form. Data must be in a perceptible form (such as being able to be heard or seen in audio, video, and multimedia files). Many large data are unstructured, and their sheer size and complexity require advanced analysis tools to create or leverage a structure that is easier to perceive and interact with.
Facing the challenge of three kinds of information in the network, the development trend of large data is becoming clearer. At the O ' Reilly Strata Conference, held in New York in September this year, the development trend was summed up as big data and agile data. Enterprise data is the most challenging problem, but also the need to solve the problem of corporate profitability. and agile Big Data is another issue that needs attention. such as Greenplum and Aster are involved in the Enterprise Bi field.
If it turns out that big data has to buy enterprise-class products, that means big data can cost a lot. But this is not absolute, with large data agile technology, companies of all sizes can control costs and benefit from large data. It is essential to minimize costs and maximize the understanding of large datasets, once the data has been converted to usable, have insight into the business, then aggregate the problems in various ways, and take advantage of the advantages of enterprise technology to solve the problem.
MapReduce Ease of use is the biggest obstacle to its development
One reason for the success of the MapReduce system is that it provides a simple programming model for writing code that requires large-scale parallel processing. It is inspired by Lisp's functional programming features and other functional languages. MapReduce is very compatible with cloud computing. The key feature of MapReduce is its ability to hide operations parallel semantics-parallel programming for developers.
But nowadays, it's hard for mapreduce to be the way business people talk about big data. Because you need at least four skills to use MapReduce.
1. Translating business issues into analytical solutions
2. Converting analytical solutions to mapreduce models
3. Ability to debug, encode, and optimize mapreduce to process data
4. Experience with Hadoop and MapReduce and the ability to debug code deployed on Hadoop
In the era of large data, the use of traditional databases to query, sort, define, and extract data is somewhat inadequate. The nature of processing large data services, such as MapReduce, requires more skills. But it is unrealistic to hire these highly skilled people in large numbers.
The combination of SQL and MapReduce tradition and modernity
SQL is a familiar pattern for programming experts and business analysts to query data. And MapReduce's charm lies in the ability to handle the relatively complex search queries in program scenarios. What happens if you combine the two?
Aster has provided a framework called Sql-mapreduce that enables data scientists and business analysts to quickly investigate complex information, allowing a group of associated computers (computer clusters) to use software languages (such as Java, C #, Python, C + +, and R) The program is expressed in parallel and then used through standard SQL activation (invocation).
The Greenplum provides support for SQL and MapReduce parallel processing, and can handle TB-level to petabytes of enterprise data at a lower cost. Greenplum consolidates MapReduce and SQL technologies and directly executes MapReduce and Sql.greenplum directly inside the Greenplum parallel Data flow engine (located in the center of the Greenplum Data Engine) MapReduce enables programmers to analyze data sets of petabytes of scale stored in the Greenplum data engine and outside. The benefit is to respond to the growing standard programming model to meet the reliability and familiarity of relational databases.
and leading vendors like Microsoft are also involved. Microsoft has introduced Hadoop's connectivity to SQL Server, and customers will be able to exchange data between Hadoop, SQL Server, and parallel data warehouses. At the same time, Microsoft also with Hortonworks to carry out deep cooperation. The goal is to combine Hortonworks's expertise in Hadoop and Microsoft's Ease-of-use features, and to simplify the technology for downloading, installing, and configuring several Hadoop technologies.
In the future, as the combination of SQL and MapReduce technology continues to improve, MapReduce will become easier to use and widely noticed. Believe me, time will prove it.