2015-10-10 Zhang Xiaodong Oriental Cloud Insight
InfoWorld a 2015-year-old open source Tool winner in Distributed data processing, streaming analytics, machine learning, and large scale data analysis, here's a brief introduction to these award-winning technical tools.
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/74/B0/wKiom1YmIgWzP1Q1AAHSXc4hokg426.jpg "title=" Bossies-2015-big-data-100613944-gallery.idge.jpg "alt=" Wkiom1ymigwzp1q1aahsxc4hokg426.jpg "/>
1. Spark
In Apache's Big Data project, Spark is one of the hottest, especially for heavyweight contributors like IBM, making spark grow and progress fast.
The sweetest spark point with Spark is still in the machine learning field. Since last year, the Dataframes API has replaced the Schemardd API, similar to the discovery of R and pandas, making data access simpler than the original RDD interface.
Spark's new development also has new workflows for building repeatable machine learning, scalable and optimized support for various storage formats, simpler interfaces to access machine learning algorithms, improved monitoring of cluster resources, and task tracking.
By default in Spark1.5, the Tungsten memory manager provides faster processing power by fine-tuning the layout of data structures in memory. Finally, the new spark-packages.org website has more than 100 third-party linked library extensions, adding a number of useful features.
2. Storm
Storm is a distributed computing framework project in the Apache project, which is mainly used in the field of streaming data real-time processing. He is based on a low-latency interactive mode concept to address complex event processing needs. Unlike spark, Storm can do a single point of random processing, not just a micro-batch task, and the need for memory is lower. In my experience, he has more advantages over streaming data processing, especially when data is quickly transferred between the two data sources, which requires a fast processing scenario.
Spark masks a lot of storm light, but Spark is not a good fit for many applications that churn out data. Storm is often used in conjunction with Apache Kafka.
3. H2O
H2O is a distributed memory processing engine for machine learning, which has an impressive array of algorithms. Earlier versions supported only the R language, and version 3.0 began to support Python and the Java language, as well as Spark's execution engine on the backend.
The best way to use H2O is to extend it as a large memory extension of the R environment, and the R environment does not directly work with large datasets, but rather by extending communication protocols such as the rest API to communicate with the H2O cluster, H2O to handle a large amount of data.
Several useful r extension packages, such as ddply, have been packaged to allow you to break the limit on the memory capacity of your local machine when dealing with large datasets. You can run H2O on EC2, or a Hadoop cluster/yarn cluster, or a Docker container. With soda (spark+ H2O) You can access the parallel access Spark RDDS on the cluster after the data frame has been processed by spark. And then pass it on to a H2O machine learning algorithm.
4. Apex
Apex is an enterprise-class big Data dynamic processing platform that enables instant streaming data processing or batch data processing. It can be a yarn native program that supports a large-scale, extensible, and support-tolerant streaming data-processing engine. It native supports general event handling and ensures data consistency (accurate one-time, at least one time, up to once)
The code, documentation, and architecture design of apex-based business processing software, previously developed by Datatorrent company, shows that Apex is able to separate application development clearly in support of DevOps, and user code often does not need to know that he is running in a streaming media processing cluster.
Malhar is a related project that provides more than 300 commonly used application templates to implement common business logic. Malhar's library of links can significantly reduce the time it takes to develop apex applications and provides connectors and drivers for connecting various storage, file systems, messaging systems, and databases. And can be extended or customized to meet the requirements of the individual business. All Malhar components are used under Apache's license.
5. Druid
Druid became a commercially-friendly Apache license this February and is a hybrid engine based on the event stream, capable of meeting OLAP solutions. Initially he was primarily used in the field of online data processing in the advertising market, where druids can allow users to do arbitrary and interactive analysis based on time series data. Some of the key features include low latency event handling, fast aggregation, approximate and accurate calculations.
The core of Druid is a custom data store that uses dedicated nodes to handle each part of the problem. Real-time analytics is handled based on a real-time management (JVM) node, where the final data is stored in the historical node responsible for the old data. The proxy node directly queries the real-time and historical nodes, giving the user a complete event message. The test shows that 500,000 event data can be processed within one second and the processing capacity can reach a peak of 1 million per second, Druid as an ideal real-time processing platform for online ad processing, network traffic, and other activity streams.
6. Flink
The core of Flink is an event flow data flow engine. While the surface is similar to spark, Flink is actually using a different in-memory approach. First, Flink as a stream processor from the beginning of the design. Batching is just a special case of streaming with a start and end state, Flink provides APIs to handle different scenarios, both API (batch) and data flow APIs. Developers of the MapReduce world should feel at home in the face of the dataset processing API, and porting the application to Flink is easy. In many ways, like Flink and Spark, its simplicity and consistency make him popular. Like Spark, Flink is written in Scala.
7. Elasticsearch
Elasticsearch is a distributed file server based on Apache Lucene search. At its core, Elasticsearch constructs data indexes in near real-time based on JSON format, enabling fast full-text retrieval capabilities. Combined with the open source Kibana bi Display tool, you can create an impressive data visualization interface.
Elasticsearch is easy to set up and expand, and he is able to automatically use new hardware as needed for sharding. His query syntax is not the same as SQL, but it is also a familiar JSON. Most users do not interact with the data at that level. Developers can interact with native json-over-http interfaces or several common development languages, including Ruby,python,php,perl,java,javascript.
8. Slamdata
If you are looking for a user-friendly tool to understand the latest popular NoSQL data visualization tools, then you should take a look at slamdata. Slamdata allows you to use familiar SQL syntax for nested queries of JSON data, without the need for conversion or syntax modification.
One of the main features of the technology is its connector. From Mongodb,hbase,cassandra and Apache Spark,slamdata, most industry-standard external data sources can be easily integrated and data transformed and analyzed. You might ask, "Do I not have a better data pool or data Warehouse tool?" Please recognize that this is in the NoSQL domain.
9. Drill
Drill is a distributed system for interactive analysis of large data sets, spawned by Google's Dremel. Designed for low-latency analysis of nested data, drill has a clear design goal that extends flexibly to 10000 of servers to process query log data and support mega-level data logging.
Nested data can be obtained from a variety of data sources (such as Hdfs,hbase,amazon S3, and BLOBs) and multiple formats (including Json,avro, and buffers), and you do not need to specify a pattern when reading ("read-time Mode").
Drill uses the ANSI 2003 SQL query language as the basis, so the data engineer is not learning pressure, it allows you to connect query data and across multiple data sources (for example, connecting hbase tables and logs in HDFs). Finally, drill provides an ODBC-based and JDBC interface for interfacing with your favorite bi tools.
Ten. HBASE
HBase reached its 1.X milestone this year and continues to improve. Like other non-relational distributed data stores, HBase's query results feed back very quickly, so it is often used for background search engines such as ebay, brocade, and Yahoo websites. As a stable, mature software PRODUCT, HBase's fresh features are not always present, but this stability is often the most concern of the enterprise.
Recent improvements include increased regional server improvements for high availability, rolling upgrade support, and yarn compatibility improvements. Includes scanner updates in his feature updates to ensure improved performance and the ability to use hbase as a streaming media app for persistent storage like storm and spark. HBase can also support SQL queries through the Phoenix Project, and its SQL compatibility is steadily improving. Phoenix has recently added a spark connector that adds the functionality of a custom function.
One. Hive
With the development of hive over the years, the 1.0 official version was released this year, and it is used in the field of SQL-based data warehousing. The foundation is now focused on improving performance, scalability, and SQL compatibility. The latest version 1.2 significantly improves acid semantic compatibility, cross-datacenter replication, and cost-based optimizer.
Hive1.2 also brings improved SQL compatibility, making it easier for organizations to move from existing data warehouses through ETL tools. Major improvements in planning: improving the integration of Llap,spark's machine learning Library with the memory cache as the core, and improving the SQL pre-nested subqueries, intermediate type support, etc.
Kylin.
Kylin is an OLAP analysis system developed by ebay to handle very large amounts of data, and he uses standard SQL syntax, much like a lot of data analysis products. Kylin uses hive and Mr to build cubes, hive serves as a pre-link, Mr is used as a pre-aggregation, HDFs is used to store intermediate files when building cubes, hbase is used for storing cubes, and HBase's coprocessor (coprocessor) is used to respond to queries.
Like most other analytics applications, Kylin supports a variety of access methods, including programmatic access to the JDBC,ODBC API and Rest API interfaces.
Cdap.
CDAP (Cask data Access Platform) is a framework that runs on top of Hadoop, abstracting the complexities of building and running big data applications. Cdap revolves around two core concepts: data and applications. The CDAP data set is a logical representation of data, regardless of the underlying storage layer, and CDAP provides real-time data stream processing capabilities.
Applications use the CDAP service to handle scenarios such as distributed transactions and service discovery, and to prevent program developers from drowning in the underlying details of Hadoop. Cdap comes with a data ingestion framework and some pre-built applications and some generic "packages" such as ETL and web analytics, support testing, commissioning and security. Like most original commercial (closed source) projects Open source, CDAP has good documentation, tutorials, and examples.
Ranger.
Security has always been a sore spot for Hadoop. It does not mean (as is often reported) that Hadoop is "unsafe" or "unsafe". The fact is that Hadoop has a lot of security features, although these security features are not very powerful. I mean, each component has its own authentication and authorization implementation, which is not integrated with other platforms.
In May 2015, Hortonworks acquired XA/security, then after the renaming, we had ranger. Ranger makes many of the key parts of Hadoop under one umbrella, allowing you to set a "strategy" that binds your Hadoop security to your existing ACL based on the Active Directory authentication and authorization system. Ranger gives you a place to manage access control for Hadoop, managing, auditing, and encrypting it through a beautiful page.
Mesos.
Mesos provides efficient, resource isolation and sharing across distributed applications and frameworks, supporting Hadoop, MPI, hypertable, Spark, and more.
Mesos is an open source project in the Apache incubator that uses zookeeper for fault-tolerant replication, uses Linux containers to isolate tasks, and supports multiple resource planning allocations (memory and CPU). Provides Java, Python, and C + + APIs to develop new parallel applications that provide a Web-based user interface could be free view cluster status.
The Mesos application (framework) coordinates the two-level scheduling mechanism for cluster resources, so writing a mesos application doesn't feel like a familiar experience for programmers. Although Mesos is a new project, growth is fast.
NiFi.
Apache NiFi0.2.0 was released, and the project is currently in the incubation stage of the Apache Foundation. Apache NiFi is an easy-to-use, powerful and reliable data processing and distribution system. Apache NiFi is designed for data flow. It supports data routing, transformation, and system mediation logic for highly configurable indicators.
Apachenifi is an open source project by the United States National Security Agency (NSA) to the Apache Foundation, which is designed to automate the flow of data between systems. Thanks to its workflow-based programming philosophy, the Nifi is very easy to use, powerful, reliable and highly configurable. The two most important features are its powerful user interface and good data backtracking tools.
The Nifi user interface allows users to intuitively understand and interact with the data flow in the browser, allowing for faster and more secure iterations.
Its data backtracking feature allows the user to see how an object can flow through the system, replay, and visualize what happens after the critical steps, including a number of complex schema transformations, Fork,join and other operations.
In addition, Nifi uses a component-based extension model to quickly add functionality to complex data streams, including FTP,SFTP and HTTP, which handle file systems in out-of-the-box components, and also support HDFs.
Nifi has received consistent acclaim from the industry, including Hortonworksceo,leverage CTO and Prescient Edge chief System architect.
Kafka.
In the big Data world, Kafka has become the de facto standard for distributed publishing subscription messages. Its design allows agents to support tens of thousands of customers at the time of information throughput telling processing while maintaining durability through distributed commit logs. Kafka is a single log file stored on an HDFS system, and because HDFs is a distributed storage system that makes redundant copies of data, Kafka itself is also well protected.
When the consumer wants to read the message, Kafka looks for its offset in the central log and sends them. Because messages are not immediately deleted, adding consumers or re-sending historical information does not incur additional consumption. Kafka has been able to send 2 million messages per second. Although the Kafka version number is sub-1.0, in fact Kafka is a mature, stable product that is used in some of the world's largest clusters.
Opentsdb
OPENTSDB is an HBase database built on a time series basis. It is designed for the analysis of data collected from applications, mobile devices, network devices, and other hardware devices. It customizes the HBase schema for storing time series data and is designed to support fast aggregation and minimal storage space requirements.
By using HBase as the underlying storage tier, OPENTSDB is well-supported for distribution and system reliability. The user does not interact directly with HBase, and the data-writing system is managed through a Time series Daemon (TSD), which can be easily extended for applications that require high-speed processing of data volumes. There are some prefabricated connectors that publish data to OPENTSDB and support reading data from clients in Ruby,python and other languages. Opentsdb is not good at interactive graphics processing, but it can be integrated with third-party tools. If you are already using hbase and want an easy way to store event data, Opentsdb may just be right for you.
Jupiter.
Everyone's favorite note-taking app is gone. Jupyter is "IPython" stripped out to become a separate language-independent part of the package. Although the Jupyter itself is written in Python, the system is modular. Now you can have a Ipython interface that makes it easy to share code and visualize documents and data on your laptop.
At least 50 languages of the kernel, including Lisp,r,f #,perl,ruby,scala, are already supported. In fact, even though the Ipython itself is just a jupyter python module. The language kernel communication via REPL (read, evaluate, print cycle) is via protocol, similar to NREPL or slime. It is nice to see such a useful software that has been funded by significant nonprofit organizations to further develop, such as parallel execution and multi-user notebook applications.
Zeppelin.
Zeppelin is an Apache incubation project. A web-based notebook that supports interactive data analysis. You can use SQL, Scala, and so on to make data-driven, interactive, and collaborative documents. (similar to Ipython notebook, you can write code, notes, and share directly in the browser).
Some basic charts are already included in the Zeppelin. Visualization is not limited to sparksql queries, and the output of any language in the backend can be identified and visualized. Zeppelin provides a URL to show only the results, and that page does not include Zeppelin menus and buttons. This way, you can easily integrate it into your site as an IFRAME.
Zeppelin is not mature yet. I wanted to put a demo, but couldn't find an easy way to disable "Shell" as an execution option (in other things). However, it already looks like the visual effect is better than the Ipython notebook application, Apache Zeppelin (hatching) is Apache2 licensed software. Provide 100% of open source.
650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ eazbllcahycyfsx3f9xmw68jon2aiaia9zxsnibogwufjwsp5liculyia0dlp31f8pmzbe0wfyjtcqmcipfy0vt4saa/640?tp=webp& Wxfrom=5&wx_lazy=1 "style=" margin:0px;padding:0px;border:0px;height:auto;width:171px; "alt=" 640?tp=webp& Wxfrom=5&wx_lazy=1 "/> scan QR code focus on" Orient Cloud Insight Public number
need point-to-point communication add: jackyzhang523
Help you understand the results of deep insights related to public clouds. Bring the deepest and freshest: Cloud market analysis, cloud opportunity Insights, cloud critical event reviews, cloud chat, Cloud Forum information, and CEOs face-to-head discussions at the top end of the public cloud world.
This article is from the "Oriental Cloud Insights" blog, please be sure to keep this source http://2368606.blog.51cto.com/2358606/1704650
Big Data Bossie Awards-20 best Open source Big Data technology