Big data is a broad term that refers to data sets that are so large and complex that they require specially designed hardware and software tools for processing. The data set is usually trillions or EB in size. These data sets are collected from various sources: sensors, climate information, public information such as magazines, newspapers, articles. Other examples of big data generation include purchase transaction records, web logs, medical records, military surveillance, video and image archives, and large-scale e-commerce.
In big data and
big data analysis, they have a high interest in the impact of enterprises. Big data analysis is the process of researching large amounts of data to find patterns, correlations, and other useful information that can help companies better adapt to changes and make more informed decisions.
First. Hadoop
Hadoop is a software framework capable of distributed processing of large amounts of data. But Hadoop is processed in a reliable, efficient, and scalable manner. Hadoop is reliable because it assumes that computing elements and storage will fail, so it maintains multiple copies of working data to ensure that it can be redistributed for failed nodes. Hadoop is efficient because it works in parallel and speeds up processing through parallel processing. Hadoop is also scalable and can handle petabytes of data. In addition, Hadoop relies on community servers, so its cost is relatively low and anyone can use it.
Hadoop is a distributed computing platform that allows users to easily construct and use. Users can easily develop and run applications that process massive amounts of data on Hadoop. It mainly has the following advantages:
1.High reliability. Hadoop's ability to store and process data bit by bit is worthy of trust.
2. High scalability. Hadoop distributes data and completes computing tasks among available computer clusters. These clusters can be easily extended to thousands of nodes.
3. High efficiency. Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.
4.High fault tolerance. Hadoop can automatically save multiple copies of data, and can automatically redistribute failed tasks.
Hadoop has a framework written in Java language, so it is ideal to run on a Linux production platform. Applications on Hadoop can also be written in other languages, such as C ++.
Second, HPCC
HPCC, the abbreviation of High Performance Computing and Communications. In 1993, the Federal Council for Science, Engineering, and Technology of the United States submitted to Congress a report on the "Major Challenge Project: High Performance Computing and Communications", also known as the HPCC Program, the US President ’s Science Strategy Project. The purpose is to solve a number of important scientific and technical challenges through strengthening research and development. HPCC is a plan for the implementation of the information superhighway in the United States. The implementation of this plan will cost tens of billions of dollars. Its main goal is to achieve: the development of scalable computing systems and related software to support terabit network transmission performance and the development of thousands Megabit network technology to expand research and educational institutions and network connectivity.
The project is mainly composed of five parts:
1. High-performance computer system (HPCS), including the research of future generations of computer systems, system design tools, advanced typical systems and evaluation of original systems;
2. Advanced software technology and algorithms (ASTA), software support with huge challenges, new algorithm design, software branches and tools, computing and high-performance computing research centers, etc .;
3. National Research and Education Grid (NREN), which includes research and development of intermediate stations and 1 billion-bit transmission;
4. Basic Research and Human Resources (BRHR), which includes basic research, training, education, and course materials, is designed to reward investigators-started, long-term investigations to increase the stream of innovation awareness in scalable high-performance computing, Increase the pooling of skilled and well-trained personnel through improved education and high-performance computing training and communications, and to provide the necessary infrastructure to support these survey and research activities;
5. Information Infrastructure Technology and Application (IITA), the purpose is to ensure the leading position of the United States in the development of advanced information technology.
Third, Storm
Storm is free and open source software, a distributed, fault-tolerant real-time computing system. Storm can handle huge data streams very reliably and is used to process Hadoop batch data. Storm is very simple, supports many programming languages, and it is very interesting to use. Storm is open sourced from Twitter, and other well-known application companies include Groupon, Taobao, Alipay, Alibaba, Le Element, Admaster, etc.
Storm has many application areas: real-time analysis, online machine learning, nonstop computing, distributed RPC (remote procedure call protocol, a request for service from a remote computer program through the network), ETL (abbreviation of Extraction-Transformation-Loading, That is, data extraction, conversion and loading) and so on. Storm's processing speed is amazing: after testing, each node can process 1 million data tuples per second. Storm is scalable, fault-tolerant, and easy to set up and operate.
Fourth, Apache Drill
To help business users find a more effective and faster way to query Hadoop data, the Apache Software Foundation recently launched an open source project called "Drill". Apache Drill implements Google ’s Dremel.
According to Tomer Shiran, product manager of Hadoop vendor MapR Technologies, "Drill" has been operated as an Apache incubator project and will continue to be promoted to software engineers worldwide.
The project will create an open source version of Google's Dremel Hadoop tool (Google uses this tool to speed up Internet applications for Hadoop data analysis tools). "Drill" will help Hadoop users to query massive data sets faster.
The "Drill" project is actually inspired by Google ’s Dremel project: this project helps Google realize the analysis and processing of massive data sets, including analyzing and crawling Web documents, tracking application data installed on the Android Market, analyzing spam, and analyzing Test results on Google's distributed build system, etc.
Through the development of the "Drill" Apache open source project, organizations will hopefully establish the API interfaces and flexible and powerful architecture that Drill belongs to, thereby helping support a wide range of data sources, data formats, and query languages.
Fifth, RapidMiner
RapidMiner is the world's leading data mining solution, with advanced technology to a very large extent. It covers a wide range of data mining tasks, including various data arts, which can simplify the design and evaluation of the data mining process.
Functions and features
- Free data mining technology and library
- 100% Java code (can run on the operating system)
- The data mining process is simple, powerful and intuitive
- Internal XML guarantees a standardized format to represent the exchange data mining process
- Simple script language can be used to automate large-scale processes
- Multi-level data view to ensure effective and transparent data
- Interactive prototype of graphical user interface
- Command line (batch mode) automatic large-scale application
- Java API (application programming interface)
- Simple plugin and promotion mechanism
- Powerful visualization engine, visual modeling of many cutting-edge high-dimensional data
- Supported by more than 400 data mining operators
Yale University has been successfully applied in many different application fields, including text mining, multimedia mining, functional design, data stream mining, integrated development methods and distributed data mining.
Sixth, Pentaho BI
The Pentaho BI platform is different from traditional BI products. It is a process-centric and solution-oriented framework. Its purpose is to integrate a series of enterprise-level BI products, open source software, API and other components to facilitate the development of business intelligence applications. With its emergence, a series of independent products oriented to business intelligence such as Jfree, Quartz, etc. can be integrated together to form a complex and complete business intelligence solution.
The Pentaho BI platform, the core architecture and foundation of the Pentaho Open BI suite, is process-centric because the central controller is a workflow engine. The workflow engine uses process definitions to define business intelligence processes that execute on the BI platform. Processes can be easily customized or new processes can be added. The BI platform contains components and reports to analyze the performance of these processes. At present, the main components of Pentaho include report generation, analysis, data mining and workflow management. These components are integrated into the Pentaho platform through J2EE, WebService, SOAP, HTTP, Java, JavaScript, Portals and other technologies. The distribution of Pentaho is mainly carried out in the form of Pentaho SDK.
The Pentaho SDK contains five parts: the Pentaho platform, the Pentaho sample database, the stand-alone Pentaho platform, the Pentaho solution example, and a pre-configured Pentaho web server. The Pentaho platform is the main part of the Pentaho platform, including the main body of the Pentaho platform source code; the Pentaho database provides data services for the normal operation of the Pentaho platform, including configuration information, Solution related information, etc. For the Pentaho platform, it It is not necessary, it can be replaced by other database services through configuration; the stand-alone Pentaho platform is an example of the stand-alone operating mode of the Pentaho platform, which demonstrates how to make the Pentaho platform run independently without application server support;
The Pentaho solution example is an Eclipse project that demonstrates how to develop relevant business intelligence solutions for the Pentaho platform.
The Pentaho BI platform is built on the basis of servers, engines and components. These provide the system's J2EE server, security, portal, workflow, rules engine, diagrams, collaboration, content management, data integration, analysis, and modeling capabilities. Most of these components are based on standards and can be replaced with other products.