Overview Big Data

Source: Internet
Author: User
Keywords Large data
Tags analysis apache based big data business business development business intelligence cloud

Big data has grown rapidly in all walks of life, and many organizations have been forced to look for new and creative ways to manage and control such a large amount of data, not only to manage and control data, but to analyze and tap the value to facilitate business development. Looking at big data, there have been a lot of disruptive technologies in the past few years, such as Hadoop, Mongdb, Spark, Impala, etc., and understanding these cutting-edge technologies will also help you better grasp the trend of large data development. It is true that in order to understand something, one must first understand the person concerned with the thing. Therefore, to understand the large data, light understanding technology is not enough, this article in the large data field of the 10 giants, will help you to more in-depth grasp of large data in the industry's development situation.

Ten open source technologies in large data fields

According to the latest Cisco Global Cloud Index, it is expected that by the end of 2017, the global data center will achieve an average annual IP traffic of 7.7ZB. Overall, data center IP traffic will grow at a compound annual growth rate of 25% per cent between 2012 and 2017 (CAGR).

Growth is now faster and organizations need to rely on a large number of datasets to help them operate, quantify, and grow their business. Over the past few years, large databases have undergone a process of development from GB to TB to petabytes.

In addition, data is no longer stored in one place, and with the growth of these data and the development of cloud computing, the data is distributed and stored.

Almost all industries are developing big data and data science

Science: The Large Hadron Collider collides about 600 million times per second. Therefore, only when the sensor flow data is less than 0.001%, the data generated from the four Large Hadron Collider experiments means that 25PB of data will be produced each year (statistically in 2012), and the backup will generate a large amount of data, and the backup data may reach 200PB.

Study: NASA's Climate Simulation Center (NCCS) stores about 32PB of climate observation and analog data on its supercomputer platform.

Private/public: Amazon handles up to millions of of back-end operations per day, plus more than 500,000 query operations from 3rd party sellers. Amazon's core technology runs on Linux based database systems, and as of 2005, Amazon has three of the world's largest Linux databases, with capacity of 7.8TB, 18.5TB, 24.7TB.

Organizations are forced to look for new creative ways to manage and control such huge data, not just to collate data, but to analyze and mine data for further business development, so some open source large data technologies are worth considering:

Apache HBase: This large data management platform is built on Google's powerful bigtable management engine. As a database with open source, Java coding, and distributed multiple advantages, HBase was originally designed for the Hadoop platform, and this powerful data management tool is also used by Facebook to manage the vast data of the messaging platform.

Apache Storm: A distributed real-time computing system for processing high-speed, large data streams. Storm adds reliable real-time data processing capabilities to Apache Hadoop, while also adding low latency dashboards, security alerts, and improved operating methods to help businesses capture business opportunities and develop new businesses more efficiently.

Apache Spark: This technology uses the memory computation, starts from the Multiple Iterations batch processing, allows the data to load into the memory to do repeatedly the query, moreover also fuses the data warehouse, the flow processing and the graph computation and so on many kinds of computational paradigm, Spark uses the Scala language realization, constructs in the HDFs, It's a good combination with Hadoop and runs 100 times times faster than MapReduce.

Apache Hadoop: This technology quickly becomes one of the big data management standards. When it is used to manage large datasets, Hadoop represents a very good performance for complex distributed applications, and the flexibility of the platform enables it to run on commercial hardware systems, and it can easily integrate structured, semi-structured, and even unstructured datasets.

Apache Drill: How big a DataSet do you have? In fact, no matter how big a dataset you have, Drill can handle it easily. An interactive analysis platform was established to support HBase, Cassandra, and Mongodb,drill, allowing for large-scale data throughput and rapid results.

Apache Sqoop: Maybe your data is now locked in the old system, Sqoop can help you solve the problem. This platform is a concurrent connection that allows you to easily transfer data from a relational database system to Hadoop, customizing the data type and the mapping of metadata propagation. In fact, you can also import data (such as new data) into HDFs, Hive, and HBase.

Apache Giraph: This is a powerful graphics processing platform with good scalability and usability. The technology has been adopted by Facebook, Giraph can be run in a Hadoop environment and can be deployed directly to existing Hadoop systems. In this way, you can get powerful distributed mapping capabilities, while also leveraging existing large data processing engines.

The Cloudera Impala:impala model can also be deployed on your existing Hadoop cluster to monitor all queries. The technology, like MapReduce, has a powerful batch capability and Impala for real-time SQL queries, and with efficient SQL queries, you can quickly learn about data on large data platforms.

Gephi: It can be used to correlate and quantify information, and you can get a different insight from the data by creating powerful visualizations for the data. Gephi already supports multiple chart types and can be run on large networks with millions of nodes. Gephi has an active user community, Gephi also provides a large number of plug-ins, can and the existing system perfect integration, it can also be complex it connectivity, distributed systems in various nodes, data flow and other information for visualization analysis.

MongoDB: This solid platform has been admired by many organizations, and it has excellent performance in large data management. MongoDB was originally created by DoubleClick employees and is now widely used in large data management. MongoDB is a NoSQL database developed using open source technology that can be used to store and process data on a platform such as JSON. At present, the New York Times, Craigslist and many companies have adopted MongoDB to help them manage large datasets. (Couchbase server also serves as a reference).

In our DoD (Data-on-demand) society, a large amount of data is generated every day, and a large amount of data is collected in the major IT systems. Whether it's a social media photo or an international store transaction, a lot of high-quality, quantifiable data is exploding every day, and the only way to deal with it is to quickly deploy an efficient management solution.

Remember that, in addition to quickly classifying and organizing data, IT managers must have the ability to mine information and apply it to the business. The science behind business intelligence and data quantification will continue to develop and expand, and the key to competitive advantage for enterprises is the ability to manage their data well.

Top Ten Big data fields that can't be ignored

Amazon Web Services

  

Forrester calls AWS the "Cloud Overlord," and when it comes to big data in the cloud computing world, it has to mention Amazon. The company's Hadoop product is known as EMR (Elastic Map Reduce), and AWS explains that the product uses Hadoop technology to provide large data management services, but it is not a pure open-source Hadoop that has been modified and is now specifically used on the AWS Cloud.

Forrester says EMR has good market prospects. Many companies provide services to customers based on EMR, and some companies apply EMR to data query, modeling, integration, and management. And AWS is innovating, and Forrester says future EMR can be scaled automatically based on workload needs. Amazon plans to provide more powerful EMR support for its products and services, including its redshift data Warehouse, the newly released kenesis real-time processing engine, and the planned NoSQL database and business intelligence tools. But AWS does not have its own version of the Hadoop release.

Cloudera

  

Cloudera has a release of open source Hadoop, a distribution that incorporates many of the technologies of the Apache OSS Open source project, but the distributions based on these technologies have also made great strides. Cloudera has developed a number of features for its Hadoop release, including the Cloudera manager for management and monitoring, as well as the SQL engine called Impala. Cloudera's Hadoop distribution is based on open source Hadoop, but it's not a pure open-source product. When Cloudera customers need some functionality that Hadoop does not have, Cloudera engineers will implement these features or find a partner with the technology. "Cloudera's innovative approach is true to core Hadoop, but because it enables rapid innovation and is responsive to customer needs, this makes it different from other vendors," says Forrester. "At present, Cloudera platform has more than 200 paid customers, some customers with Cloudera technical support has been able to cross over 1000 nodes to achieve the effective management of PB-level data."

Hortonworks

  

Like Cloudera, Hortonworks is a pure Hadoop technology company. Unlike Cloudera, Hortonworks believes Open-source Hadoop is more powerful than any other vendor's Hadoop release. Hortonworks's goal is to build the Hadoop ecosystem and the Hadoop user community to advance the development of open source projects. The Hortonworks platform is tightly linked to open source Hadoop, and company executives say it benefits users because it protects them from being stuck with vendors (if Hortonworks customers want to leave the platform, they can easily switch to other open source platforms). This is not to say that Hortonworks relies entirely on open source Hadoop technology, but because it returns all the results of its development to the open source community, such as Ambari, a tool developed by Hortonworks to populate the cluster management project vulnerabilities. Hortonworks's solution has been supported by vendors such as Teradata, Microsoft, Red hat and SAP.

IBM

  

When a business considers large IT projects, many people first think that IBM.IBM is one of the main players in the Hadoop project, and Forrester says IBM has more than 100 Hadoop deployments, and many of its customers have petabytes of data. IBM has extensive experience in many fields such as Grid computing, global data centers, and implementation of large enterprise data projects. "IBM plans to continue to integrate many technologies such as SPSS analysis, high-performance computing, BI tools, data management and modeling, and workload management for High-performance computing." ”

Intel

  

Like AWS, Intel continually improves and optimizes Hadoop to run on its own hardware, specifically, to allow Hadoop to run on its Xeon chips, helping users break some of the limitations of the Hadoop system, and make software and hardware more integrated, Intel's Hadoop release has done a better job of this. Forrester points out that Intel has only recently launched the product, so there is a lot of potential for the company to improve in the future, both Intel and Microsoft are considered potential shares in the Hadoop market.

MAPR Technologies

  

MAPR's Hadoop release may be best so far, but many people may not have heard of it. Forrester's survey of Hadoop users shows that MAPR has the highest rating and its distribution has the highest score on architecture and data processing capabilities. MAPR has incorporated a special set of features into its Hadoop release. For example, network File System (NFS), disaster recovery, and high-availability features. Forrester says MAPR is not as Cloudera and Hortonworks in the Hadoop market as mapr to be a real big business, but also to strengthen partnerships and marketing.

Microsoft

  

Microsoft has been keeping a low profile on open source, but in the big data situation it has to consider Windows compatible with Hadoop and is actively engaged in open source projects to promote the development of the Hadoop ecosystem more broadly. We can see the results in Microsoft's public cloud Windows Azure hdinsight products. Microsoft's Hadoop service is based on the Hortonworks distribution and is tailored for azure.

Microsoft also has a number of other projects, including a project called Polybase, that allows Hadoop queries to implement some of the features of SQL Server queries. Forrester says: "Microsoft has a great advantage in the market for databases, data warehousing, Cloud, OLAP, BI, spreadsheets (including PowerPivot), collaboration and development tools, and Microsoft has a huge user base, but there is still a long way to go to become an industry leader in the field of Hadoop. ”

Pivotal Software

  

EMC and VMware part of the large data business spin-off portfolio Pivotal.pivotal has been working to build a superior Hadoop release, and Pivotal has added new tools on the basis of open source Hadoop, including a SQL engine called HAWQ, and a hadoo to address large data issues. P Application. Forrester says the advantage of the pivotal Hadoop platform is that it consolidates the many technologies of pivotal, EMC, and VMware, and that pivotal's real advantage is actually the backing of two big companies, EMC and VMware. So far, pivotal has less than 100 users, and mostly small and medium sized customers.

Teradata

  

For Teradata, Hadoop is both a threat and an opportunity. Data management, especially with regard to SQL and relational databases, is a Teradata area of expertise. So the rise of a nosql platform like Hadoop could threaten Teradata. Instead, Teradata accepted Hadoop, and Teradata, in collaboration with Hortonworks, integrated SQL technology into the Hadoop platform, This allows Teradata customers to easily use the data stored in the Teradata Data Warehouse on the Hadoop platform.

Amplab

  

By turning data into information, we can understand the world, and that's what Amplab does. Amplab focuses on machine learning, data mining, database, information retrieval, natural language processing and speech recognition, and strives to improve the screening techniques for information including opaque datasets. In addition to spark, open source distributed SQL query engine shark also stems from Amplab,shark has a very high query efficiency, with good compatibility and scalability. In recent years, the development of computer science into a new era, and Amplab for us to the use of large data, cloud computing, communications and other resources and technology to solve the problem of flexible solutions to cope with the increasingly complex challenges.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.