Inventory Nine popular open source large data processing technology

Source: Internet
Author: User
Keywords Open source large data processing technology

As global corporate and personal data explode, data itself is replacing software and hardware as the next big "oil field" driving the information technology industry and the global economy.

Compared with the fault-type information technology revolution such as PC and Web, the biggest difference of large data is that it is a revolution driven by "open source software". From giants such as IBM and Oracle to big data start-ups, the combination of open source software and big data has produced astonishing industrial subversive forces, and even VMware, which used to rely entirely on proprietary software, has embraced open source and large data tools.

Below, we will list the nine most popular large data open source technology for your reference.

I. Hadoop

Apache Hadoop is an open source software framework that enables distributed processing of large amounts of data. Introduced by Apache Software Foundation in fall 2005 as part of the Lucene subproject Nutch, the developer Doug of Hadoop Cutting originally developed Hadoop to meet the cluster processing needs of the Open-source Web search engine Nutch, cutting implemented MapReduce functionality and Distributed File System (HDFS) and integrated it into Hadoop. Hadoop's name is inspired by cutting's son's toy elephant. By Mapreduce,hadoop, large data is decomposed into small chunks and distributed to all common server nodes. Hadoop is currently the most popular large data (including unstructured, semi-structured and structured data) storage and processing technology. The open source authorization method for Hadoop is Apache License2.0.

Ii. R

R is an Open-source programming language and software environment designed for data mining/analysis and visualization. R is an implementation of the S language. The S language is a kind of interpretive language developed by At&t Bell Laboratory for data exploration, statistical analysis and drawing. Originally, the implementation version of the S language was mainly s-plus. But S-plus is a commercial software, in contrast to the open source R language is more popular, known as the "red Hat of the statistical community."

In a survey of "Data mining/analysis tools that you have used in the past and 12 months in your actual project," R has topped the list with 30.7% of the votes in KDNuggets2012, surpassing Microsoft Excel (29.8%) and RapidMiner (2010 and 2011). Notably, four of the top five data mining tools this year are open source software. In addition, R is also defeating SQL and Java, ranking first in the most popular data mining application programming languages rankings.

Iii. cascading

As the Open-source software abstraction layer of Hadoop, cascading allows users to create and perform data processing workflows on Hadoop clusters using any JVM-based language. Cascading can hide the complexity at the bottom of mapreduce tasks. Chris Wensel Design cascading is designed to be a standby API for MapReduce. Cascading is often used in advertising orientation statistics, log file analysis, bioinformatics analysis, machine learning, predictive analysis, Web content text mining, and ETL applications. Cascading's business support is provided by concurrent, a company created by cascading's designer Wensel. Famous websites using cascading include Twitter and Etsy. Cascading open source under GNU.

Iv. Scribe

scribe is a server software developed by Facebook, released in 2008. scribe can aggregate log files from a large number of servers in real time. Facebook's design scribe is designed to deal with its own scaling challenges, and Facebook now uses scribe to deal with tens of millions of messages a day. scribe is open source under Apache License2.0.

V. Elasticsearch

Elasticsearch based on Apachelucene, developers are Shay Banon. Elasticsearch is a distributed RESTful open source search server and an extensible solution that supports near-real-time search and multi-tenant without special configuration. Many companies have adopted Elasticsearch, including StumbleUpon and Mozilla Firefox. Elasticsearch is open source under the Apache License2.0 authorization method.

Six, Apache HBase

HBase is a scalable, column-oriented, distributed, relational database running on HDFs. HBase is written in the Java language and supports structured data storage for large tables (big table). The advantage of HBase is that it can make fault-tolerant storage and can quickly access massive sparse data. HBase is one of the representatives of the NoSQL database that has emerged in the past few years. Facebook used HBase to build a message platform in 2010, HBase Open source under Apache License2.0.

Seven, Apache Cassandra

The Apache Cassandra is the open source distributed database management system developed by Facebook, which is used to search the user's Inbox, Cassandra is also a NoSQL database. In 2010, Facebook abandoned Cassandra instead of using HBase. But Cassandra is still being used by companies such as Netflix using Cassandra as a background database for video services. Cassandra is open source under Apache License2.0.

Viii. MongoDB

Developed by DoubleClick founder, MongoDB is a popular open-source NoSQL database. MongoDB stores structured data in a class JSON document through dynamic mode Bson. MongoDB is adopted by many large companies, including MTV NX, Craigslist, Disney Interactive Media Group, The New York Times and Etsy. MongoDB is open source under GNU and is licensed by 10gen business.

Nine, Apache CouchDB

Apache Couchdb is also an open source NoSQL database. Use JSON to store data, use JavaScript as the query language, and APIs using MapReduce and HTTP. COUCHDB was developed by former IBM Lotus Notes developer Damien Katz as a storage system for large-scale object databases. Note Media Group BBC uses COUCHDB as a dynamic content platform, COUCHDB open source under Apache License2.0.

Guess you like:

1. Great Data processing technology

2. Large Data processing technology--python

3. Trends in large data processing technologies-introduction of five open source technologies

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.