the necessity of data processing tools
The beauty of Hadoop is the provision of inexpensive distributed data storage and processing frameworks that allow us to save and process massive amounts of data at a very low cost. However, open source Hadoop still has a high demand for user skills: Familiarity with Java, MapReduce interfaces to write data processing programs, and familiarity with hive SQL or pig can be used to write data processing logic in a variety of tool languages.
For most data analysts and data scientists, learning these skills is not difficult, but learning and using these underlying skills consumes a lot of valuable time, so a feature-rich, easy-to-use data processing tool can undoubtedly be of great help for business people, Data analysts and data scientists save a lot of time and effort. Bigsheets is a graphical tool designed to handle massive amounts of data.
bigsheets Function Introduction
Bigsheets is a spreadsheet tool for data processing and analysis of big data, with built-in support for a variety of data sources, data filtering, content completion, and a variety of practical data processing functions to combine and process data from different tables, as well as visualize data in the form of charts. A rich data import and export interface is provided.
Bigsheets Architecture Introduction
Bigsheets a complete set of data processing frameworks between users and Hadoop: Users create workbooks in a browser interface, define data filtering, and process data transformations as needed; The Bigsheets engine transforms the process of the front-end input into an executable job (Pig) Bigsheets runs the data processing flow on the sample data, displays the results to the user for preview, waits for confirmation, and after the user confirms, Bigsheets runs the arithmetic logic on the full-volume data and obtains the final processing result. The architecture of the Bigsheets is as follows:
Bigsheets Use Example
This example shows how to use Bigsheets to process massive order data, demonstrating basic data processing, including: parsing, filtering, sorting, merging, and processing of results. The order data that needs to be processed is uploaded to the HDFs directory in advance.
Step 1, login to the Bigsheets interface:
Bigsheets provides a browser-based management interface and user interface, and Bigsheets relies on biginsightshome and Knox Services in addition to the most basic Hadoop component hdfs/yarn/mapreduce: The Biginsightshome service provides a unified access interface for IBM value-added components (bigsheets/bigsql/textanalytics), and Knox provides a secure, unified access portal for external visitors.
Enter address in the browser address bar: https://< Management node
ip>:8443/gateway/default/biginsightsweb/index.html access, you can use the default user Guest/guest-password login:
Step 2, import the data into HDFs and create a new Workbook (Workbook):
You can create a bigsheets workbook from a local file/directory or from an HDFs file/directory. The bigsheets includes a variety of data parsers, including basic web crawler data, character segmentation data, CSV-formatted text data, hive data parsers, JSON data parsers, and TSV data. Shows the creation of a workbook data source from a CSV file in HDFs:
Step 3, define the data processing logic in the generated copy of the workbook:
The initial workbook created from the HDFs file is read-only and needs to be copied to the new workbook before the data processing logic is added. It shows that the order data is filtered by the time condition, the subset of data that needs to be processed is extracted, and then sorted according to the time condition.
Data sources that typically perform data analysis may come from multiple data sources, need to process the data according to the actual situation, and then merge, showing the deletion of redundant columns of data from different data sources, and merging the order data from multiple data sources through the Union operation.
Bigsheets provides a large number of ready-made processing tools, including:
Filter: Filtering data that does not meet the criteria, such as the user name is empty, etc.;
function: Add data processing functions (built-in 96 functions), such as the sum of input values;
Load: import data from another workbook, such as merging data from different tables;
jion: associating data in multiple tables, similar to joins in SQL statements;
Group : Data grouping: The data are grouped and each group of data are calculated accordingly;
Union: Data merging, merging data from multiple tables into one;
intersection: data intersection, obtaining coincident data in two or more tables by the specified column, requiring the same data pattern;
complement: data take-over, the data is selected by the specified column, and the data pattern is required;
limit: Limits the number of rows processed in the data, and processes the amount of data processed in the order of Top (N);
Distinct: Remove duplicate values from the table, and keep only one for each group;
Copy : copying data from other electronic tables;
Formula: Adds a data processing formula.
After you have defined the data processing flow, you can view the data processing process from the management interface through a streaming diagram, as shown in:
Step 4, data processing for the full volume, and save the results:
In the process of editing data processing, the display results seen in Bigsheets are the first 2000 rows of data in the dataset are simulated and the top 50 rows of data are displayed. After confirming that the data processing logic is correct, click the "Run" button to run the full-volume data processing.
Bigsheets will start the MapReduce job in the background with pig and show progress through the progress bar at the foreground. After the task is completed, the results of the data processing can be used.
The three common usage scenarios are: Using data in bigsheets, including viewing and drawing through spreadsheets, creating bigsql/hive data tables for datasets, and accessing data through Sql/hive SQL, and exporting spreadsheet data to HDFs for external use. Shows how to export files and create data tables in Bigsheets:
You can also draw directly as needed to visualize the data by visualizing the chart. Bigsheets supports a variety of common charts, including pie charts, histograms, line charts, and geographic maps, showing a pie chart showing sales by region:
PostScript
In Big data analytics, the amount of data processed is as large as terabytes to petabytes, and data processing is the most time and effort spent in the analytics team. Bigsheets data processing ability, can effectively reduce the data processing process development and maintenance time, is the Big Data Analysis team is one of the rare data processing tools.
Biginsights Diamond bigsheets: 0 Programming! Processing massive amounts of data