Trends in large data-processing technology-introduction of five open source technologies

Source: Internet
Author: User
Keywords Well we big data these

Large data areas of processing, my own contact time is not long, formal projects are still in development, by the large data processing attraction, so there is the idea of writing articles. Large data is presented in the form of database technologies such as Hadoop and "NO SQL", Mongo and Cassandra. Real-time analysis of data is now likely to be easier. Now the transformation of the cluster will be more and more reliable, can be completed within 20 minutes. Because we support it with a table? But these are just some of the newer, untapped advantages and extraordinary opportunities beyond these conventional conjectures.

You know, there are more than 250,000 open source technologies on the market today. Around us, these increasingly complex systems, as we've seen, look at the following chart:

We still have a lot of options in the least choice. What are your goals? What are the next treasures of 2000 companies? What projects can be used as reliable candidates in the real product phase? What should be paid special attention to? We've done a lot of research and testing, so let's look at 5 new technologies that shake big data. These are a few new sets of tools, let's take a look.

Storm and Kafka are the main ways of data stream processing in the future, they have been used in some large companies hungry, including Groupon, Alibaba and the Weather channel. Storm, born on Twitter, is a distributed real-time computing system. Storm is designed to handle real-time computing, and Hadoop is primarily used to handle batch operations.

Kafka is a messaging system developed by LinkedIn that is part of the system as a conduit base for data processing. When you use them together, you can get data in real time and linearly incrementally.

Why do you need to care?

Using storm and Kafka, the data stream is processed in a linear way, ensuring that each message is obtained in real time and reliably. The storm and Kafka can handle 10,000 data per second smoothly.

Data flow processing schemes such as Storm and Kafka make many enterprises concerned and want to achieve excellent ETL (pull conversion load) data integration scenarios. Storm and Kafka are also good at memory analysis and real-time decision support. The enterprise's use of a batch-processing Hadoop scheme is no wonder for real-time business needs. Real-time data flow processing is a necessary module in the enterprise's large data solution, because it gracefully handles "3v"--volume,velocity and produced (capacity, speed and diversity). These 2 technologies, Storm and Kafka, are the most recommended technologies for us (infochimps) and they will also be part of our platform as a formal component. Drill and Dremel realize a fast and low load of large-scale, ad hoc query data search. They provide the possibility of a second-level search for P-level data to respond to ad hoc queries and forecasts, and to provide robust virtualization support.

Drill and Dremel provide powerful business processing capabilities, not just for data engineers. Everyone on the business side will like Drill and Dremel.drill is Google's Dremel open source version. Dremel is a technology that Google provides to support large data queries. The company will use it to develop its own tools, which cause everyone to pay close attention to drill. While these are not the start, the open source community has a strong interest in making it more mature.

Why should you care?

Drill and Dremel are better at analyzing ad hoc queries than Hadoop. Hadoop only provides bulk data processing workflows, which are also drawbacks.

The Hadoop biosphere makes MapReduce a very friendly tool for advertising analysis. Many interface-layer applications from sawzall to pig to hive make Hadoop friendlier and closer to business, but, like SQL systems, these abstractions ignore an important fact--mapreduce (or Hadoop) for systematic data processing processes. If you're not worried about what tasks to run? If you don't care about the problems that arise and seek answers, keep your mouth shut and remain insightful. "Impromptu exploration"--if you've already taken on data processing, how quickly do you optimize? You should not run a new task or wait, and sometimes consider a new question.

In the methodology of the workflow based on the heap contrast, many business-driven bi and analysis queries are both basic and temporary, with low latency analysis. Writing Map/reduce workflows are prohibited in many business analytics. Wait a few minutes wait for jobs to start, wait a few hours, etc. to perform these data-free interactive experiences, these comparisons, and scaling comparisons ultimately produce a basic new vision. Some data scientists have long speculated that drill and Dremel will outperform Hadoop and agree that some are still under consideration, and that a few enthusiasts immediately embrace change, but these are the main advantages in more query-oriented and low-latency situations. In Infochimps we like to use the Elasticsearch Full-text indexing engine to achieve database data search, but really in the large data processing we think drill will become mainstream.

R is a powerful statistical programming language for open source. Since 1997, more than 2 million of statistical analysts have used R. This is a modern version of the S language that was born from Bell Labs in the field of statistical computing and quickly became the new standard of statistical language. R makes complex data science cheaper. R is an important leader in SAS and Spass, and as an important tool for the best statisticians.

Why should you care?

Because it has an extraordinarily powerful community to support, you can find all of the R's class libraries, creating virtual types of scientific data without having to write new code. R is exciting because of the maintenance of his people and the creation of a new day. R Community is one of the exciting places in the Big data field. R is an excellent technology that is not obsolete in large data fields. In recent months, thousands of new features have been introduced by analysts with an increasingly open knowledge base. Also, R and Hadoop work well together, as part of a large data processing has been proven. Stay focused: Julia is an interesting r replacement because it doesn't like R's dead slow-slow interpreter. Julia's community is not strong now, but if you don't use it immediately, you can wait. Gremlin and giraph help to enhance graphical analysis and are used in diagram databases like neo4j and Infinitegraph, and in giraph that work with Hadoop. The Golden Orb is an example of a project based on another high level stream processing diagram. can look. The graph database is a charming and marginalized database. There are a lot of interesting differences between them and relational databases, and this is when you start out thinking about graph theory rather than relational type theory.

Another similar graph based on the theory is Google's Pregel, compared to Gremlin and Giraph is its open source alternative. In fact, these are examples of the SHANZHAI implementations of Google technology. Graphs play an important role in computing network modeling and social networks, and can connect arbitrary data. Another frequent application is mapping and geo-information computing. Calculate the shortest distance from a to B location. Graphs are also widely used in the fields of bio-computing and physical computing, for example, they can draw unusual molecular structures. Massive graphs, graph databases, and analytic languages and frameworks are all part of a real-world implementation of large data. The theory of graphs is a killer application. Any solution to a large network node problem is handled through the path between nodes and nodes. Many creative scientists and engineers are very clear about using the right tools to solve the problem. Make sure that they all work beautifully and can be spread widely.

SAP Hana is a full memory analysis platform that contains a memory database and some related tool software used to create analytical processes and standardize the correct format for data input and output.

Why should I care?

SAP is beginning to object to the development of powerful products for solidified enterprise users for free use. It's not just sap that starts thinking about startups, letting them use Hana. They authorize the development of Community solutions, and these unusual practices revolve around Hana results.

Hana assumes that other processes are not fast enough to solve problems such as financial modeling and decision support, website personalization and spoofing detection. Hana's biggest drawback is "full memory", which means accessing the soft state of memory, which is quite clear, but this is also an expensive part compared to disk storage. According to organizers, without worrying about operating costs, Hana is a large data processing tool that quickly delays.

D3 was not originally in the list, but its intimacy, let us think that has to mention its value. D3 is a JavaScript-oriented visual library of documents. It's powerful and innovative so that we can see the information directly and let us do normal interactions. Its author is Michael Bostock, a New York Times graphic Interface designer. For example, you can use D3 to create h™l tables from any number of arrays. You can use any data to create an interactive progress bar. Here is a practical example of D3, which created 2013 Obama's public opinion. With D3, programmers can create interfaces and organize all kinds of data.

Although this article is not long, but also cost me a practical translation, the lack of translation I hope you will correct me. In fact, when I read this article, I would like to share it with people who like it, thanks to an open environment, so the United States in the IT sector is always so surprising, of course, we have to keep up.

It has been nearly a year since the official use of Hadoop, which came out from Baidu to the BitWare in the present, in different companies, with different technologies to solve the problem. But the problem is always a few, of course, many companies are also beginning a taste of the use of Hadoop. This is the big environment that is so understandable.

Here is a personal understanding of the article:

Storm and Kafka from 11 onwards, began to pay attention to, Storm in Ali also has part of the second line application, but overall, just a year old storm in Nathanmarz under the grinding more and more stable, and some online application. So for this technology, overall, I am still very optimistic, because now using Hadoop can not achieve real-time processing, using HBase for the main database to use, temporarily still can be solved, but still want to try to storm,kafka attention is not a lot, but this together with the use of , is said to be very good, did not run by himself.

Drill This is Apache open source project, before also looked at the Google Dremel paper, helpless to see is not very understand, and now did not encounter such an environment, and the community just fire up, so there is not much time to follow up, temporarily shelved.

R language, before the time in Baidu, the next door you guys do is in the use of R language work, this may be only large companies can have the ability to real mining aspects, we are now in the basic business has not used, for R is still very unfamiliar, but my personal task, in different environments using different technical means, as if, Doctor Sound photoelectric Blow box, we have an electric wind blowing, is the same realization bar.

In the field of graph database, there is really no detailed application, there is no chance to enter such a company, so still shelved it.

Spa This company, heard the name, but no specific contact, now sell solution estimates are not good, get a thing to improve the visibility is still necessary. Now the time to chew the old is over.

The last Visual JS Class library, the interest is not big, the business is not to do the front end, so it is OK.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.