Common strategies for large Internet sites to address massive data volumes

Last Update:2018-06-10 Source: Internet

Author: User

Tags mysql book columnar database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Compared with traditional storage environments, the data storage of large Internet sites is not only as simple as a server and a database, it is a complex system consisting of network devices, storage devices, application servers, public access interfaces, and applications. It can be divided into business data layer, computing layer, data warehousing, data backup, and data storage through application server software

Compared with traditional storage environments, the data storage of large Internet sites is not only as simple as a server and a database, it is a complex system consisting of network devices, storage devices, application servers, public access interfaces, and applications. It can be divided into business data layer, computing layer, data warehousing, and data backup. It provides data storage services through application server software and monitors storage units through monitoring tools.

As the amount of user data in the system increases linearly, the amount of data will increase. In such an environment where data is constantly expanding, data has been flooded. It is difficult to search and call data. In the case of massive data, some user-submitted requests often wait until the next day to learn the results, which directly affects the improvement of user satisfaction and the layout of new businesses. Technically speaking, this feature severely limits RDBMS in large application scenarios. The only option is Scale Out. By adding resources of multiple logical units, and make them provide services as a centralized resource to achieve system scalability.

The data in the system is like the items in our house, clothes in the closet, and dishes in the cupboard. The database and storage system are like your wardrobe and cupboard are a storage container, clothes and dishes are like different data. Put different types of things into a suitable bucket, so that the efficiency and utilization of the system will be higher, so we will make the following designs ,:

Click here to view the big chart

The structure model of a large system storage unit consists of six parts:

1. Business Data Layer
Various types of file data generated by various businesses, including user information, user operation records, real-time business data, mobile client upgrade applications, images, and so on.

2. computing layer
For different data formats and different types of data files, operations are performed using different tools and computing methods. Some distributed and parallel computing algorithms are used for a large amount of data computing, such as MapReduce, BSP. In addition, some data is cached to relieve the pressure on the storage application server.

3. Data storage layer
For the query and storage of massive data, especially for user behavior log operations, you need to use some columnar database servers. The data for processing businesses and some business rules is still stored in relational databases, mySQL is used for storage.

4. Data Warehousing
Data storage is mainly used for user behavior logs and user behavior analysis. It is also a process of generating a large amount of data in the system. Apache Hive, Pig, and Mathout will be used to build data warehousing.

5. Data Backup
It can be divided into online data backup and offline data backup. The data backup process requires the accumulation of O & M experience to customize reasonable backup rules based on business and user traffic.

6. Hardware
The hardware environment is the most basic part of a storage unit. It is divided into disks, memory, and network device storage. Different service data and files are stored on different hardware devices.

Technical Implementation
The architecture of different business data and application servers of the system requires different read/write methods, as well as data storage types, data warehouse construction, data cold/hot separation, and data indexing. For example, business applications, log collection agents, and Filesystem in Userspace ). Data Access Proxy Layer (DDAL/Cache Handler), OLAP, log server, Oracle (tentative), MySQL, Redis, Hive, HDFS, and Moosefs.

:

Click here to view the big chart

For the above design architecture, the description list is as follows:

1. Data Access Proxy Layer
Collectively referred to as the data access proxy layer (DAPL), which encapsulates the DDAL and Cache Handler layers and abstract the written applications to facilitate expansion and maintenance. For example: you do not need to know the specific operations of HDFS on the upper layer. You only need to pay attention to the provided interfaces. DAPL encapsulates many read/write policies for accessing various data sources. Therefore, transaction integrity can be ensured for operations on different databases and data sources.

2. DDAL
The Distributed Data access layer (DDAL) is mainly used for the read/write splitting of relational databases. To implement read/write splitting, you must first parse the passed SQL statements, in addition, the Round-Robin algorithm is used to load a large number of data reads. In code implementation, the MySQL-JDBC parameter configuration is used to load MySQL-Slave.

3. Cache Handler
Similar to DDAL, the difference is that you have implemented the Round-Robin algorithm to load a large number of data reads, in addition, you can assign a new Master to perform write operations when the Redis Master is running.

4. Redis one master multiple slave
The read/write splitting of the Cache data reduces the I/O bottleneck of a single machine. It is worth mentioning that the Cache is not a reliable storage. Therefore, during the design, the Cache data must be allowed to be lost, therefore, when all the Cache data fails, it will be reloaded from the database.

5. MySQL dual-master, multi-slave
This approach is the best solution in the MySQL architecture design. It guarantees the data access pressure and data reliability. The front-end two Master MySQL Databases back up data with each other, and a large number of Slave MySQL databases at the back end synchronize the data written by the Master. Therefore, the data in the MySQL database on each server node is consistent, in addition, the DDAL application writes data polling to the Master MySQL database.

6. Database read/write splitting
Mysql-Prxoy is used to learn MySQL-Prxoy policies and develop a read/write Splitting Method for MySQL book nodes. The MySQL driver supports the data integrity of read/write splitting, sharding is used when the data volume is very large.

7. cache read/write splitting
The policy for caching Redis. To use self-developed applications, You need to implement the Round Robin algorithm to perform read/write splitting on the Redis Master and Slave cache clusters.

8. ETL Tools
The Pig in the Apache Hadoop project is used to clean massive behavior data. Pig can execute SQL-like scripts for regular semi-structured data, in addition, the computing pressure can be distributed and processed in parallel on each server.

9. Hive Cluster
Apache Hive is a data warehouse framework built on Hadoop. It provides a convenient data integration method and a Hive QL Query Language similar to SQL, the Map/Reduce algorithm supports large-scale data analysis on the Hadoop framework.

10. HDFS Distributed File System
All the data in Hive is stored in the Hadoop Distributed File System, and all the stored data will have data storage copies, which guarantees data reliability.

11. Moosefs Distributed File System
It is different from the HDFS file system mentioned above. Moosefs does not require any client program to operate on distributed files on the server, and can be directly connected to any running environment, the server also has the copy Copy function.

12. cold/hot data separation
Classify and store the contents generated in the system, and abstract more users' concerns and hot topics into "hot data" generated in the past few days ", the earlier the data we abstracted in the design is divided into "Cold data ". It can be seen that "Hot nodes" store the latest and frequently accessed data. For this part of data, we hope to provide users with the fastest query speed, so there will be a clear distinction in both hardware and software options, such: recently, frequently accessed data will be stored in the system cache. business data that needs to be frequently accessed will be stored in MySQL or Oracle database systems,

Related Articles
Common strategies for large Internet sites to address high concurrency

-End-

Original article address: common strategies for solving massive data on large Internet sites. Thank you for sharing with me.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More