Introduction to big data (3)-adoption and planning of big data solutions

Source: Internet
Author: User

Big Data projects are driven by business. A complete and excellent big data solution is of strategic significance to the development of enterprises.

Due to the diversity of data sources, data types and scales from different data sources have different characteristics. When processing and analyzing big data, more dimensions (governance, security, etc.) are involved ).

Therefore, before using big data analysis, you must consider the entire project management process and decision-making framework in advance. The main considerations include:

1. Prerequisites

High-quality data, complete processes, excellent employees, and preset duration.

 

2. Data retrieval source

Consider data from all channels (internal and external) and all data that can be used for analysis, including the data format, collection method, and scale.

Main sources include: Internal Enterprise (system, data management system (DMS), external Enterprise (public data and commercial data ).

DMS stores Logical Data, processes, policies, and various other types of documents.

 

3. Data Privacy Management

Protects sensitive data and develops data shielding (marked and anonymous) and storage measures.

 

4. Data Security

Consider using user authentication and authorization mechanisms to ensure the security of the database management system.

Non-relational databases exchange data using plaintext communication APIs, which lacks security.

Application Programming Interface (API) is an application programming interface that enables communication between computer software. You can use the postman tool for retrieval.

 

5. Metadata

Big data may be in different States due to transmission, processing, and storage during different analysis processes in the life cycle. These changes automatically trigger the generation of metadata, which can be used as a basis for tracing the results in the future, while ensuring data accuracy and reliability. Therefore, a framework is required to store metadata.

 

6. Timeliness

Different services have different requirements on timeliness. There are two Processing Methods: Batch Processing and stream processing.

Different processing methods have different platforms and hardware support (for example, storm-free open-source distributed stream processing Computing System and hadoop-free open-source distributed batch processing computing system ).

 

7. Hardware performance

Due to the large amount of data, the data query and transmission time may be too long, so you need to upgrade the relevant hardware facilities.

 

8. Data management framework

When transferring data to an enterprise for processing, storage, analysis, clearing, and storage, the process and guidelines for monitoring, building, storing, and protecting data are also developed, which helps solve data complexity and other problems.

The data management framework also considers the following:

  • Manage large amounts of data in various formats;
  • Continuous training and management of necessary statistical models for preprocessing of unstructured data and analysis;
  • Set policies and compliance systems for external data retention and use;
  • Define data archiving and clearing policies;
  • Create policies on how to copy data across various systems;
  • Set Data Encryption policies.

 

9. Establish a feedback loop Mechanism

Establish an appropriate feedback loop mechanism to optimize the Analysis Steps and obtain more accurate analysis results.

 

10. Storage and computing environments

Multiple Data storage options are provided, such as cloud, relational database, non-relational database, and distributed file storage (DFS.

However, generally, all/part of the big data environment uses cloud-based hosting.

 

When all preparations are ready, you can start to solve the actual business.

For specific projects,Due to the differences between big data and traditional data, big data analysis has diverse needs, so it has a unique life cycle, which can be divided into nine stages:

Figure 1 lifecycle of big data analysis

 

1. Case study:

Smart + identify big data problems (based on 5 V features) + evaluate budget and income.

  • Specpacific (specific)-clarify the business reasons, motivations, and objectives;
  • Measurable (measurable)-KPI Creation
  • Attainable (achievable)-analyzes available resources;
  • Relevant (related)-analyze potential threats;
  • Timely (timely)-Can it be implemented on schedule.

 

2. Data Identification:

Try to find different types of related datasets and try to find hidden information from them.

 

3. Data Acquisition and filtering

Classify the obtained data, filter the corrupted data, and back up and compress the data before filtering.

"Corruption data" includes non-structured or unrelated types such as loss, meaningless values, and null values.

 

4. Data Extraction

Query and extract the data required for analysis. At the same time, the data is changed to the desired format based on the analysis type and big data solution capabilities.

Currently, the main challenge is to convert unstructured data formats (such as XML and JSON) into data formats that are easy to analyze.

 

5. Verification and cleaning

Verify associated datasets by integrating verification fields, filling in missing data, and removing known invalid data from redundant datasets. (Seemingly invalid data may contain some hidden law. For example, the off-group value can be used to study risks)

The batch data verification cleaning process is carried out in the offline ELT system, and the stream processing is carried out in complicated memory.

 

6. Data Integration and representation

The process of logically or physically integrating data of different sources and formats and displaying it through a single view (such as a two-dimensional table. At the same time, some integrated data is stored for subsequent data analysis.

Data integration includes two layers: formal data integration and semantic data integration.

  • Formal data integration: different operating systems, databases, and programming languages define different basic data types, as a result, data has different representation and storage methods, and direct mutual reference between different systems produces incorrect results. Therefore, it is necessary to integrate the data form and adopt the conversion rules to finally establish a data warehouse with a unified standard structure.
  • Semantic Data Integration: different data sets indicate that data of the same content has different values. Therefore, the semantic consistency part of the integrated data must be aligned to be processed by the system. This part of work can be done manually or manually by machines. However, the current technical level does not support full completion by machines.

 

7. Data Analysis

Different analysis methods are used to extract business insights from data.

Data analysis methods can be divided into descriptive analysis, Validation Analysis (hypothesis → test), and exploratory analysis (induction ).

At the same time, an appropriate iteration method is established and repeated analysis is performed to improve the possibility of reliable analysis results.

 

8. Data Visualization

Different visualization technologies are used to display analysis results in Graphs for different application scenarios. This allows professional analysts to communicate with users and discover potential answers.

 

9. Application of analysis results

Using the output results at the analysis layer, the user may be a visualization application, person (decision maker), or a business process.

 

 

When you have decided to build a new/updated existing big data solution, the next step is to identify the components required for the big data solution from the following two perspectives.

1. logic layer of the big data solution:

The logic layer provides a way to organize related components, where different components perform different functions.

These layers are only logical layers, and do not support independent operation of each layer. On the contrary, each layer is closely linked and data flows between layers.

Big Data Solutions generally consist of the following logical layers:

  • Big Data Source
  • Data Change and storage layer
  • Analysis Layer
  • Usage Layer

2. Vertical layer:

All aspects of the components that affect the logic layer are included in the vertical layer. The vertical layer includes the following layers:

  • Information Integration
  • Big Data governance
  • System Management
  • Service quality

 

For details about the components and relationships between the logical layer and the vertical layer, see.

Figure 2 logical layer and vertical layer components

Introduction to big data (3)-adoption and planning of big data solutions

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.