2013 Bossie Selection: Best Open source Large data tool

Last Update:2014-12-18 Source: Internet

Author: User

Keywords HTTP official website we

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The appearance of MapReduce is to break through the limitations of the database. Tools such as Giraph, Hama and Impala are designed to break through the limits of MapReduce. While the operation of the above scenarios is based on Hadoop, graphics, documents, columns, and other NoSQL databases are also an integral part of large data.

Which large data tool meets your needs? The problem is really not easy to answer in the context of the rapid growth in the number of solutions available today.

Apache Hadoop

When people talk about "big data" or "Data science," they tend to refer to Hadoop projects. In general, Hadoop borrows the MapReduce framework, but the project also contains a large number of important tools that are closely related to data storage and processing. Similar to MapReduce 2.0, the emergence of a new yarn framework marks the key step in Hadoop's path to development. You can expect this big wave of data to come soon in your business environment.

No heavyweight start-up has yet relied on the Apache program, but Hadoop is more popular in this area. Analysts expect Hadoop to eventually build a huge market with a whopping $ tens of billions of trillion worth of the year. We must not miss the opportunity to develop because of the budget.

Andrew C. Oliver

Official website: http://hadoop.apache.org/

Apache Sqoop

Speaking of large data processing, it is certainly hadoop that everyone thought of first, but this does not mean that the traditional database is not qualified for this job. In fact, in most cases we still need to extract the data needed from the traditional database, which is the forte of Apache Sqoop.

Sqoop can effectively improve the efficiency of data conversion between traditional database system and Hadoop, because it adopts a series of technologies such as concurrent connection, customizable data type mapping and metadata propagation. You can import data, such as pure new data, into HDFs, Hive, and HBase, and return the analysis results to the traditional database side. Sqoop can also manage the memory complexity of data connectors and the data format for matching errors.

--James R. Borck

Official website: http://sqoop.apache.org/

Talend Open Studio for the big Data

Talend Open Studio for big Data allows you to load files directly into Hadoop (via HDFs, Hive, and Sqoop) without manual encoding. The native Hadoop code generated by its graphical IDE (which supports Yarn/mapreduce 2) enables large-scale data transformation using a Hadoop distributed environment.

The Talend virtual Mapping tool allows users to create data streams and test them without involving pig. In addition, project scheduling and job optimization tools are also further enhanced by the toolkit's functional lineup.

The first step in collating a large amount of data is to assemble the data from multiple sources into Hadoop, which is then transferred from Hadoop to other platforms. Talend Open Studio helps you do whatever you want to do with the migration process, and you don't have to worry about complexity at all.

--James R. Borck

Official website: http://www.talend.com/products/big-data

Apache Giraph

The Apache giraph is a graphics processing system designed for high scalability and high availability requirements. As an open-source alternative to Google's Pregel, Giraph has been used by Facebook companies to analyze user social profiles and their relationships. The system adopts an efficient whole synchronous parallel processing mode from Pregel, which avoids the inherent problems of mapreduce in processing graphics content. The good news: the Giraph computing process can run as a Hadoop task in everyone's existing hadoop infrastructure. As long as you run other similar tools at the same time, you have the equivalent of a distributed graphics processing capabilities.

--Indika Kotakadeniya

Official website: http://giraph.apache.org/

Apache Hama

Like Giraph, Apache Hama also introduces the whole synchronous parallel processing mechanism into the Hadoop ecosystem, and the Hadoop distributed file system as the basis of operation. Unlike the Giraph, which focuses on graphics processing tasks, Hama is a more common framework for performing a large number of model and image computing tasks. It combines the good compatibility of Hadoop with a more flexible programming model, providing an excellent operational basis for data-intensive scientific applications.

--Indika Kotakadeniya

http://hama.apache.org/

Cloudera Impala

The significance of Cloudera Impala in real-time SQL query can be equated with mapreduce in the field of batch processing. The Impala engine is located in each of the data nodes of the Hadoop cluster, thus providing flexibility to listen for query requests. After analyzing the query, it will generate a set of execution plans through optimization, and be responsible for the coordination of parallel processing among the compute nodes in the cluster. Through these efforts, Impala provides users with lower SQL query latency in the HAOOP environment and understands large data in a near real-time state.

Since Impala can also directly use everyone's native Hadoop infrastructure (such as HDFs, HBase, and hive metadata), multi-party collaboration will constitute a unified platform that allows users to achieve comprehensive data analysis without the complexities of the connector, ETL, or expensive data warehouses. In addition, Impala can be easily obtained from any JDBC source, so it can be an ideal component in business intelligence kits such as Pentaho.

--James R. Borck

Official website: http://www.cloudera.com/content/cloudera/en/home.html

Serengeti

As VMware brings virtualization into the larger data-processing area, Serengeti allows the Hadoop cluster to be dynamically run in a shared server infrastructure. The project leverages the Apache Hadoop virtualization extension-developed and contributed by VMware-to make Hadoop successful in virtualized environments.

With the help of Serengeti, we can complete the Hadoop cluster environment deployment within minutes without having to deal with the headaches of configuration options such as node layout, ha status or job scheduling. Further, by deploying Hadoop,serengeti in multiple sets of virtual machine systems within each host, data and computing functions can be partitioned and the computation scale improved while maintaining local data storage.

--James R. Borck

Official website: http://projectserengeti.org/

Apache Drill

The Apache drill design is inspired by the Google Dremel system, which is designed to bring a very low latency of interaction analysis for large datasets. Drill supports a variety of data sources, including HBase, Cassandra, MongoDB, and traditional relational databases. While Hadoop provides us with considerable data throughput, it takes a few minutes or even hours to analyze the content. With the help of drill, you will have the ideal response speed for interactive operation, so it will be easy and enjoyable to quickly analyze and get valuable conclusions.

--Steven Nu?ez

Official website: http://incubator.apache.org/drill/

Gephi

Graphic theory has been fully extended to all areas of the application. We can use the chain analysis to investigate the relevant traders and employees, so as to identify suspicious transaction activities. Once we have a clear picture of the critical connection points within the system, we can look at the complex IT environment in an intuitive way. As a visual discovery tool, Gephi is able to support a wide variety of graphics types as well as up to millions other network nodes in the development activities of many experts and enterprise organizations. You can find a wealth of instructional material from wikis, forums, and various instructional websites, and the active technology community brings us a plethora of plug-in options--in short, you probably don't have to start from scratch in the process of using Gephi.

--Steven Nu?ez

Neo4j

As an agile and highly-skilled graphics database, neo4j can help users in a variety of ways, including social applications, recommendation engines, fraud detection, resource validation, and data center network management, among others. NEO4J is still developing steadily in performance improvement (query result flow processing speed) and cluster/ha support performance.

--Michael Scarlett

Official website: http://www.neo4j.org/

MongoDB

Among the many NoSQL databases, the most popular may be the number of mongdb. It uses a two-dollar JSON document to store data, supporting a wide variety of document formats, and helping developers gain more free space than traditional relational databases-which forces us to use a rigorous planar development model between many lists. In addition, MongoDB provides all the functionality that developers need to get from a relational database.

2013 is very important for the history of MongoDB, and this year we have two new versions plus a new set of features, including text Search and geospatial support. The new version also performs well in performance improvements, such as the use of concurrent indexing mechanisms and the faster JavaScript engine (V8).

--Michael Scarlett

Official website: http://www.mongodb.com/

Couchbase Server

Similar to other NoSQL databases and unlike most relational databases, Couchbase server does not require users to create a schema first before inserting data. One of the features of Couchbase server is its memory cache library. This feature allows developers to seamlessly transition from a memory cache environment to other systems, and data replication effects and usability are satisfactory and do not cause downtime to applications. Its version 2.0 also adds document database functionality. 2.1 builds on this basis to incorporate more powerful storage performance across data center replication.

--Michael Scarlett

Official website: http://www.couchbase.com/why-nosql/nosql-database

PARADIGM4 SCIDB

SCIDB is a set of distributed database system, which uses parallel processing to analyze the data flow in real time. All the focus of the system is on the support effect of a large number of scientific datasets. It avoids the common row and column patterns in relational databases, and instead uses the native sequence structures that are more appropriate for ordered datasets, such as time series and location data. Unlike relational databases or Maoreduce, SCIDB provides a unified solution that enables cross cluster expansion without involving Hadoop multi-tier infrastructure and data information content.

--James R. Borck

Official website: http://scidb.org/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More