The previous articles are based on what you see and know, and are integrated with the predecessors to gain a preliminary understanding of the concept of big data. In the next four articles, we put aside the concept and basic knowledge of big data and enter the core. We start from:Data collection, data storage, data management, data analysis and mining, Discussed in four aspectsTechnologies and knowledge points involved in big data application.
Core Technical Architecture challenges:
1Challenges to existing database management technologies.
2. Classic database technology does not consider dataMulti-category (Variety), SQL (structured data query language), at the beginning of the design is not considering the storage of unstructured data..
3,Challenges of real-time Technology: In general, traditional data warehouse systems and BI applications do not require high processing time. Therefore, this type of applications still have no problem in obtaining results after 1-2 days of running through modeling. HoweverReal-time processingIs one of the key differences between big data applications and traditional data warehouse technology and BI technology.
4Network architecture, data center, and O & M challenges: With the explosive growth of data volume created every day, the technology we can improve is not big, but the possibility of data loss is increasing. Such massive data storage is the first serious problem. The speed of hardware updating will be the cornerstone of big data development, but the effect is indeed not ideal.
Analysis Technology:
1,Data Processing: Natural Language Processing Technology (NLP)
2,Statistics and analysis:A/B test, top N ranking, Region proportion, text sentiment analysis
3,Data mining:Association rule analysis, classification, and clustering
4,Model prediction:Prediction Model, machine learning, Modeling and Simulation
Storage:
1,Structured Data: Query, statistics, update, and other operations on massive data are inefficient.
2,Unstructured data: Images, videos, word, PDF, PPT, and other files are stored, which is not conducive to retrieval, query, and storage.
3,Semi-structured data: Converts to structured data storage and unstructured Storage
Solution:
1,Storage:HDFS, HBASE, Hive, MongoDB, etc.
2,Parallel Computing:MapReduce Technology
3,Streamcompute:Storm on twitter and S4 on yahoo
Big Data and cloud computing:
1,The cloud computing model is the business model, and the essence is the data processing technology.
2,Data is an asset. The cloud provides storage, access, and computing for data assets.
3. CurrentCloud computing focuses more on massive storage and computing,And provides cloud services to run cloud applications.. HoweverLackRevitalizeCapabilities of data assets, mining of value information and predictive analysisProviding decision-making solutions and services for countries, enterprises, and individuals is the core topic of big data and the final direction of cloud computing.
Big Data Platform Architecture:
I think this architecture diagram is not a stranger to big data processing personnel.
IaaS: Infrastructure as a service. Internet-based services (such as storage and databases ).
PaaS: Platform as a service. Provides complete or partial applications that users can access.
SaaS: software as a service. Provides complete applications that can be used directly, such as managing enterprise resources over the Internet.
This concept is not involved here. In the next few articles, we will discuss the related parts of the pair (mainly about the parts involved in the PaaS module) and the technical challenges and related technologies mentioned above.
Outline:
Data collection:ETL
Data storage:Relational databases, NoSql, SQL, etc.
Data management:(Infrastructure Support) cloud storage and Distributed File Systems
Data analysis and mining:(Result Display) Data Visualization
The purpose of this article is not to give everyone a thorough understanding of the detailed ETL process. You only need to know that this is the first step in data processing and the beginning of everything.
Big Data Technology-data collection ETL:
The data collection process can be simply understood as follows:If you have a database, there will be data.
Here we are more concerned aboutETL process of dataIn the early stage of ETL, you only need to understand the basic category of ETL..
In the scope of data mining,Data cleansingThe preliminary process can be simply consideredIs the ETL process. Along with the development of ETL, data mining has become very mature. We will not discuss the ETL process much here. If it is involved in the future, it will be subdivided.
Concept:
ETL (extract extraction, transform conversion, and load loading ). ETL is responsibleDistributed and heterogeneous data sourcesFor exampleRelational Data and plane dataAfter files are extracted to the temporary middle layerCleaning, conversion, and integration, And finally loadData WarehouseOrData mart, BecomeOnline Analytical Processing,Data mining provides decision-making support for data.
ETL is an important part of building a data warehouse.The user extracts the required data from the data source.Data cleansingIn the end, according to the pre-definedData Warehouse Model,Load data to a data warehouse.The sources of its definition domain are less than 10 years old, and the technological development should also be quite mature. At first glance, there seems to be no technology or esoteric, but in actual projects, it often consumes too much manpower in this process, in the later maintenance, it is often more time consuming. The cause is as follows:In the early stage of the project, the ETL work was not properly estimated.Without careful consideration, it has a lot to do with tool support.
When selecting ETL products, four points (Cost, personnel experience, cases and technical support. Some ETL tools, such as Datastage, Powercenter, and ETLAutomation, are generated in the ETL process. In comparison with the actual ETL Tool applicationMetadata support,Support for data quality,Convenient maintenance and support for custom development functionsAnd so on. A project, fromFrom the data source to the final target table, more than a hundred ETL processes are involved, and less than a dozen. Between these processesDependency,Process for error control and recoveryAll tools are required.Important considerations. I will not discuss it more here. I will describe the specific application.
Process:
During the construction of the entire data warehouse, ETL accounted for 50%-70% of the work. Here is how the ETL process between teams is implemented. In the process of analyzing a massive amount of time, the first requirement is:Team collaborationBetter. ETL includes E, T, L, andLog Control,Data Model,Original data verification,Data qualityAnd so on.
For example, if we want to integrate the data of an enterprise in the Asia Pacific region, but each country has its own data source, some are ERP, some are Access, and the database is different, so we need to consider the network performance issues, if you directly Use ODBC to connect data sources in two locations, this is obviously unreasonable. Due to poor network performance and frequent connection, it is easy to cause the database link to crash because it cannot be released. If we place a program for exporting data to access or flat file on servers in various regions, the file can be easily transmitted through FTP.
The following describes the work required in the above case:
1. Someone has written a general data export tool, which can be java, script, or other tools. In short, it must be general and can be controlled through different script files, the file formats exported from different databases in different regions are the same. Parallel operations can also be implemented.
2. Someone writes an FTP program. bat can be used, ETL tools can be used, and other methods can be used. In short, it must be accurate and easy to call and control.
3. Someone designed a data model, including the structure exported after 1, as well as the table structure in ODS and DWH.
4. Some people write SP, including the SP to be used in ETL and the SP for routine maintenance system, such as checking data quality.
5. Some people analyze the original data, including the table structure, data quality, null values, and business logic.
6. Someone is responsible for the development process, including implementing various functions and logging.
7. Some people perform exactly the ETL test by the team. The strength of a person is limited.
In fact, what should we emphasize in the above seven steps:It is difficult for a person to accomplish things. Team-oriented.
Here we will briefly describe the ETL process: it mainly describes the simple process of handling E, T, L and exceptions. If it is used, I think you will have a deeper investigation.
1. Data cleansing:
·Data fill:Empty data and missing data can be supplemented with missing data and cannot be processed.
·Data replacement: Replace invalid data.
·Format Normalization:Convert the data format extracted from the source data to the target data format that is easy to process in the warehouse.
·Primary and foreign key constraints:Use the primary and foreign key constraints to replace illegal data or export the data to an error file for reprocessing.
2. Data Conversion
·Data merging:Implement Multi-region association, use lookup for large-and small-Table Association, and join for large-Table intersection (each field has a home index to ensure the efficiency of associated queries)
·Data splitting: Split data according to certain rules
· Row-column swapping, sorting/modifying sequence numbers, and removing duplicate records
·Data Verification:Loolup, sum, count
Implementation Method:
· In the ETL engine (SQL cannot be implemented)
· In the database (SQL can be implemented)
3. Data Loading
Method:
·Timestamp Method: Add a field in the business table as the timestamp. When the OLAP system updates and modifies business data, it also modifies the timestamp field value.
·Log table method:Add a log table to the OLAP system. When business data changes, update and maintain the log table content.
·Full table comparison: Extracts all source data. Before updating the target table, compare the data based on the primary key and field, and update or insert the updated data.
·Full table deletion and insertion:Delete the data in the target table and insert all the source data.
Exception Handling
In the ETL process, it is essential to face data exceptions. solution:
1. output the error information separately and continue to execute ETL. After the error data is modified, it is loaded separately. Stop ETL and re-Execute ETL after modification. Principle: receive data to the maximum extent.
2. For exceptions caused by external reasons such as network interruption, set the number of attempts or the time to try. After the number of supersteps or timeout, manual intervention is performed by external personnel.
3. For example, abnormal conditions such as source data structure change and interface change, data should be loaded after synchronization.
ETL is involved here. As long as we have a clear understanding, it is not just as simple as we can imagine. In the actual process, you may encounter various problems, or even communication between departments. In the process of defining it as occupying the entire data mining or analysis, 50%-70% is not enough.
If any post-project involves the ETL process, it will be discussed in detail.
CopyrightBUAA