MySQL performance optimization and Architecture Design-Chapter 1 Utilization of Cache and Search in scalability design, mysql Optimization
Chapter 1 scalability design: Utilization of Cache and Search
The scalable architecture solutions analyzed in the previous sections are basically centered on the database itself. Will this block our thinking on the path to scalability ", it cannot be more extensive. In this chapter, we will jump out of the problem of improving the scalability by relying entirely on the database itself, and extend the scalability of data services to the world outside the database!
15.1 extension beyond the database with scalable design
Databases provide data access services for applications to improve database scalability. They also provide data access capabilities, including reliability, efficiency, and ease of use. Therefore, our most fundamental goal is to make the storage service capability of the data layer more scalable, so that our investment is as direct as possible to the output.
We all understand that the data itself must have a persistent location, but is it necessary for us to make all our redundant data persistent? I think readers will certainly feel that this is not necessary, as long as there are at least two redundant copies of data for persistence. In addition, we can use some special technologies to replace the databases that require persistence, such as memory Cache, Search, and Disk File Cache and Search.
Looking for caches and Search outside of the database software to solve the scalability of the data itself has become a very important architecture upgrade actively attempted by various large Internet sites. This not only improves the scalability of the data processing layer in the entire application system, but also improves the performance to a greater extent.
For this architecture, there are already mature solutions mainly including the object-based distributed memory Cache solution Memcached and the high-performance embedded database programming library Berkeley DB, powerful full-text search engine Lucene and so on.
Of course, when using mature third-party products and occasionally implementing Cache and Search in specific application scenarios, it is not worth trying, it is also very meaningful for the company's technical accumulation. Of course, it is absolutely necessary to make a comprehensive assessment before you decide to develop and implement it on your own, not only including your own technical strength, but also to evaluate the business needs of your applications.
In fact, whether using existing third-party mature solutions or independent R & D, there is a need for investment in development resources. First of all, there are a variety of ideas to better integrate with the existing MySQL database. You can implement data communication (data update) with the Cache or Search on the database side, or directly update the Cache and Search data on the application side. Databases, Cache, and Search can be at different levels in the overall architecture or at the same level.
Next, I will conduct some targeted analysis and architecture ideas on the use of third-party mature solutions and independent R & D, and hope to help readers and friends.
15.2 rational use of third-party Cache Solutions
The biggest advantage of using mature third-party solutions is that while saving R & D costs, you can find more documentation on the internet, it is very helpful to help us solve some daily problems.
Currently, the most popular third-party Cache solutions include the object-based distributed memory Cache software Memcached and the embedded database programming library Berkeley DB. Next I will make an analysis and architecture discussion on these two solutions.
15.2.1 distributed memory Cache software Memcached
I believe that for many readers, Memcached is no stranger. It is no longer popular than MySQL. Memcached is so popular because of the following reasons:
◆ Simple communication protocol and clear API Interfaces;
◆ Efficient Cache algorithm, libevent-based event processing mechanism, superior performance;
◆ Object-oriented features, which are very friendly to application developers;
◆ All data is stored in the memory for efficient data access;
◆ Open-source software, based on BSD open-source protocol;
For the details of Memcached itself, I will not cover much here. After all, this is not the focus of this book. Next we will focus on how to use Memcached to help us improve the scalability of your data services (if you use the database again, it may be inappropriate.
To better integrate Memcached into the system architecture, you must first make an accurate positioning of Memcached in the application system. Is it just a Cache tool to improve data service performance, or makes it better integrated with the MySQL database to become a more efficient and ideal data service layer.
1. As a Cache tool to improve system performance
If we only use Memcached to improve the system performance as a Cache software, we need to maintain the synchronization of data in Memcached and that in the database through applications. At this time, Memcached can basically be understood as a front-end Cache layer than the MySQL database.
If we use Memcached as a data Cache service for the application system, there is basically no need to make any changes to the MySQL database. We only need to maintain and update the Cache through the application program. The biggest advantage of this operation is that it can completely eliminate the need for database-related architectures, but it also has a drawback: if there are many data objects to Cache, the amount of code that applications need to add will increase a lot, and the complexity and maintenance costs of the system will also rise.
The following figure shows the architecture of Memcached when it is used as a simple Cache service layer.
We can see that all data is written to the MySQL Master, including the INSERT when the data is first written, and the UPDATE and DELETE of existing data. However, for existing data, you must DELETE the data in Memcached while updating or deleting the data in MySQL to ensure the overall data consistency. All read requests are first sent to Memcached. If data is read, the system directly returns the data. If no data is read, the system reads the data from MySQL Slaves, write the read data to Memcached for Cache.
This method is generally applicable to environments with a small number of cache object types and a large amount of data to be cached. It is a fast and effective solution for performance problems. Because this architecture method is not closely related to the MySQL database itself, it does not involve too many technical details here.
2. Integrate with MySQL into the data service layer
In addition to using Memcached as a tool to quickly improve efficiency, we can also use it to improve the scalability of the data service layer and integrate it with our database, or as a buffer for the database. Let's first look at how to integrate Memcached and MySQL databases into a whole to provide external services.
In general, we have two ways to integrate Memcached and MySQL databases into a whole to provide external data services. One is to directly use Memcached's memory capacity as the second-level cache of the MySQL database to increase the cache size of the MySQL Server, and the other is to use the MySQL UDF to communicate with Memcached, maintain and update data in Memcached, while the application directly reads data through Memcached.
The first method is mainly used for scenarios where business requirements are very special and it is difficult to split data, and it is difficult to transform applications to utilize the Cache outside the database. Of course, under normal circumstances, this is certainly not possible. At present, we must use the power of the outside world. The open-source project Waffle Grid is the external force we need. I
Waffle Grid is an amazing idea created by several DBAs outside China: Since the low cost of PC Server attracts us, however, it is difficult to make a major breakthrough in its Scale-Up capability. Why not use the popular Memcached as a breakthrough in the memory ceiling of a single PC Server? Under the impetus of this idea, several young men started the Waffle Grid open-source project, using the open-source features of MySQL and Memcached, and combining the Simple Features of Memcached communication protocol, memcached is successfully implemented as an external "second-level cache" for the MySQL host. Currently, only Buffer pools for Innodb are supported.
The implementation principle of Waffle Grid is not complicated. What he does is that when Innodb is in the Local Buffer Pool (we should call it the Local Buffer Pool, before reading data from a disk data file, use the Memcached communication API to try to read the corresponding cached data from Memcached (we call it Remote Buffer ), innodb accesses disk files to read data only when the required data does not exist in the Remote Buffer. Moreover, only the data in the LRU List in the Innodb Buffer pool will be sent to the Remote Buffer Pool. Once the data is modified, Innodb will move it to the FLUSH List, waffle Grid also clears the data that enters the FLUSH List from the Remote Buffer Pool. So it can be said that Dirty Pages will never exist in the Remote Buffer Pool, which ensures that data loss will not occur when the Remote Buffer Pool fails. Is the architecture diagram of the Waffle Grid project:
As shown in the architecture diagram, we first apply the Waffle Grid Patch on the MySQL database to communicate with other Memcached servers through its connection. To ensure network communication performance, use a high-bandwidth private network between MySQL and Memcached. In addition, the architecture diagram here does not distinguish the Master database from the Slave database. It does not mean that the database cannot be distinguished, but only one. In the actual application process, you only need to apply Waffle Grid on the Slave, and the Master does not need such a large memory.
After reading the implementation principles of Waffle Grid, some readers may have some questions. Isn't the performance of all queries that require physical reads directly affected? All operations to read Remote Buffer must be obtained through the network. Is the performance high enough? In this regard, I also use the author's actual Waffle data to contact everyone's concerns:
I don't need to worry too much about the performance of this set of test comparison data obtained through DBT2. As to whether Waffle Grid is suitable for your application scenario, you can only evaluate it by yourself.
Next we will introduce another Integration Method of Memcached and MySQL, that is, through the UDF function provided by MySQL, compile the corresponding program to implement data communication update between MySQL and Memcached.
Unlike Waffle Grid, the data in Memcached is not controlled and maintained by MySQL, but maintained by applications and MySQL. Every time an application reads data from Memcached, if it finds that it cannot find the data it needs, it then reads the data from the database, and then writes the data to Memcached. MySQL controls the cleaning of invalid data in Memcached. Every time data in the database is updated or deleted, mySQL calls the Memcached API through user-compiled udfs to notify Memcached that some data has expired and deleted the data.
Based on the above implementation principle, we can design a data service layer architecture as follows:
As shown in, this architecture is compared with the above-mentioned Memcached completely and MySQL read-out as a conventional Cache server. The biggest difference is that Memcached data is changed to maintained and updated by the MySQL database, instead of updating applications. First, the data is written into the MySQL database by the application. At this time, the UDF compiled by the user above MySQL will be triggered, and then the UDF will call the communication interface of Memcached to write the data to Memcached. When data in MySQL is updated or deleted, related udfs in MySQL will also update or delete data in Memcached. Of course, we can also allow MySQL to do fewer things, just to delete the data in Memcached through UDF when the data is updated or deleted, write operations are performed by applications like the previous architecture.
Because Memcached uses Object-based data access and Hash for data retrieval, we need to set a Key to identify all data stored in Memcached, this Key is used to access all data. That is to say, if you cannot use one or more keyword conditions like the Query statement of MySQL to read the result set containing multiple data records, it is only applicable to the Data Reading method that uses a unique key to obtain a single piece of data.
15.2.2 embedded database programming library Berkeley DB
To be honest, the name of the database programming library is really awkward, but I can't find any other proper term to call Berkeley DB, so let's just use the more common online name.
What Memcached implements is a memory-based Cache. If we do not have such high performance requirements and we do not have enough budget, we can also choose database-type Cache software such as Berkeley DB. Many readers may be confused again. Why should we use another "Database" such as Berkeley DB for the MySQL database we use? In fact, Berkeley DB was also one of MySQL's storage engines in the past, but it was removed from the supported storage engine in the future because it did not know why (obtaining is related to commercial competition. While using databases, we also use database-type caches such as Berkeley DB because we can give full play to their respective advantages while using traditional general-purpose databases, at the same time, Berkeley DB's efficient key-value pairs can be used as a supplement to the performance of efficient data retrieval to achieve better data service layer scalability and higher overall performance.
The architecture of Berkeley DB can be divided into five functional modules. The five modules are relatively independent throughout the system, and can be set to use or disable one or more modules, therefore, five subsystems may be more appropriate. These five subsystems and their basic introductions are as follows:
◆ Data Access
The Data Access Subsystem is mainly responsible for storing and retrieving data. Berkeley DB also supports the following four Data Storage Methods: Hash, B-Tree, Fixed Length, and Dynamic Length. In fact, these four methods correspond to the actual format of the four data file storage. The data storage subsystem can be used independently and must be enabled.
◆ Transaction Management
The transaction management subsystem provides complete ACID transaction attributes for data processing services with transaction requirements. When you start the transaction management subsystem, you must at least enable the lock management subsystem and log system to help achieve transaction consistency and integrity.
◆ Lock Management
The lock management system provides the shared data control function to ensure data consistency. Supports row-level and page-level locking mechanisms and provides services for the transaction management subsystem.
◆ Shared memory
Shared Memory sub-system I want you to see the name should basically know what to do, that is, to manage and maintain shared Cache and Buffer, to provide data Cache services for the system to improve performance.
◆ Log System
The log system is mainly used in the transaction management system. To ensure transaction consistency, Berkeley DB also adopts the policy of first writing logs and then writing data. Generally, it is also used together with the transaction management system to close at the same time.
Based on the features of Berkeley DB, it is difficult for us to integrate it with the MySQL database as closely as Memcached does. Data Maintenance and update operations mainly need to be completed through applications. In general, the main reason for using Berkeley DB while using MySQL is to improve system performance and scalability. Therefore, most of the time, the data storage formats of the Hash and B-Tree structures are used, especially the Hash format, which is the most widely used, because this method is also the most efficient access.
In an application, each data request is first retrieved to the Berkeley DB using a preset Key. If data exists, the obtained data is returned. If data is retrieved in bits, then read it from the database again. Then, the read data is stored in the Berkeley DB according to the preset Key, and then returned to the client. When data is modified, the application must delete the data in the Berkeley DB after modifying the data in MySQL. Of course, if you want to, you can also directly modify the data in the Berkeley DB, but this may introduce more data consistency risks and increase system complexity.
From the principle point of view, the use of Berkeley DB is not much different from the use of Memcached as a pure Cache. Why don't we use Memcached? There are two main reasons: Memcached uses pure memory to store data, while Berkeley DB can use physical disks. In addition to the Hash storage format used by Memcached, Berkeley DB also supports other storage formats, such as B-Tree. The basic usage principle of Memcached is not much different from that of Memcached.
15.3 self-implemented Cache service
In fact, in addition to the mature third-party software solutions, if we have some technical strength, we can achieve the same effect through our own Cache software.
Of course, you should not be intimidated by the "technical strength" mentioned above. In fact, it is not as difficult as you think. As long as you do not want to create a software that can solve all the problems at the beginning and contains all the advantages of other third-party Cache software, you cannot leave any disadvantages behind, do not start by hoping to make software that is perfect for a product. Start from an early age and start from excellent. Never try to be a fat man in one breath. This solution is probably swallowed up.
The premise of self-development and implementation of Cache service software is that there are special application scenarios in the system. Through self-development, you can achieve personalized requirements to the maximum extent. Of course, you can also optimize your application scenarios to maximize scalability and performance. After all, only ourselves are the people who really know our application systems.
One of the most important questions to consider when deciding whether to develop on your own is whether my application system scenario is so special that the existing third-party software is difficult to solve the current main problems?
If the current third-party software has basically solved more than 80% of the problems encountered by our system, we may need to consider whether it is necessary to completely develop independently. After all, all the third-party software we choose is open-source. If some small areas cannot meet the requirements, we can add some of our own things based on the third-party software to satisfy some Personalized Requirements.
When we choose to develop the Cache service software independently, we need to pay attention to the following points:
1. Functional Requirements
A) Is it full memory or some disks?
B) Do I Need To synchronize updates in real time or can I allow latency of Cache data?
C) Do I Need To support distributed architecture?
The function mentioned here is actually the requirement range setting. Before we start R & D, we need to have a very clear scope of requirements, instead of making adjustments while developing and thinking about what to do. After all, any software system needs to be oriented to the needs of the first line, and once development begins, demand control cannot be sloppy. Otherwise, it is very likely that it will die halfway and end with failure.
2. Technical Implementation
A) data synchronization (or asynchronous) update mechanism;
B) Data Storage Method (Hash Or B-Tree );
C) communication protocol;
Technical implementation may become a major difficulty in the development process. Whether a stable and reliable data synchronization (or asynchronous) update mechanism determines the final success or failure of the Cache software. Of course, you can say that the data synchronization (or asynchronous) update is completely handed over to the application that needs to access the data from row maintenance, however, it is not clear whether you have sufficient capabilities to apply to develop a Cache software on your own, but you also need to make huge adjustments to the front-end application to adapt to this Cache software. The boss may say that since you have developed and implemented it on your own, why cannot we complete the data update and maintenance function? The data storage method directly determines the data access method, and the implementation of algorithms also directly determines the performance of the software. Finally, the communication protocol for data transmission may also be confusing. It may not be easy to design a communication protocol that is simple enough but does not limit the subsequent expansion and upgrade. After all, if each upgrade requires a data transmission communication protocol, the application transformation cost caused by each upgrade is too high. The complexity may affect the convenience of front-end applications and the performance.
A) convenient management interfaces;
B) High Availability support (automatic or manual switch );
C) basic monitoring interface;
Never ignore the maintenance cost of a software system. Once a software is used, the main task is to maintain it. If the maintainability is too poor, it may bring a lot of maintenance work, and even bring the trust and enthusiasm of frontline application staff and O & M personnel for the software.
The self-developed Cache service may have different architectures based on different functional features, but it is basically different from the architecture used in Memcached, so we will not discuss it in detail here.
Finally, I personally have a suggestion that, when using the more common Cache service (including self-implemented Cache software services, we should try our best to integrate the Cache software with our MySQL database so that they can complement each other. In addition, the front-end applications should try not to directly operate the back-end data service cluster, and try to accept and process all data processing services through an intermediate proxy layer to make the front-end applications transparent. In this way, the backend Data Service (database and Cache) layer is fully transparent to the front end, this gives our data layer real high scalability.
15.4 use Search for efficient full-text Search
Whether Memcached or Berkeley DB is used, data retrieval can only be performed in a specific way, meeting only a small part of the retrieval requirements. The database itself should be clear about the performance of Full-fuzzy LIKE operations, which is very low because indexes cannot be used for such operations. Although MySQL's MyISAM storage engine supports full-text indexing, the official version does not support multi-byte character sets, therefore, users who need to store Chinese characters or use storage engines other than MyISAM are completely unusable.
In this case, we have only one solution, that is, full-text indexing of data through full-text indexing software, which we often call the Search engine, in order to achieve a more efficient data retrieval efficiency.
Similarly, the use of Search software has two mature third-party solutions and self-developed methods.
Currently, the most famous third-party solution is Java-based Lucene, which is a sub-project under the Jakarta Project Team of the Apache Software Foundation. Of course, he is not a complete search engine tool, but a full-text search engine framework. He also provides a complete query engine and Data Index engine for retrieval.
Here I will not discuss the technical details of Lucene itself, interested readers can visit the official site (http://lucene.apache.org) to learn more and more authoritative details. Here I will mainly introduce what Luence can bring to us and how we can use it.
Because Lucene's efficient full-text indexing and Word Segmentation Algorithms and efficient data retrieval implementation, we can use this advantage to solve the full-text fuzzy search function that is completely unavailable to databases and traditional Cache software.
Our requirements are different from those of traditional generic search engines, and we do not need "Spider" to crawl data on the Internet everywhere, you only need to write the persistent data in our database by calling Lucene APIs through the application, and create indexes using Lucene, then, you can call the data retrieval API provided by Lucene to obtain the data to be accessed and perform full Fuzzy Matching. Since the process from database to Lucene is completely implemented by ourselves, it is very easy for us to control the real-time data. It can be completely real-time, and can also be refreshed in a fixed (or dynamic) period.
Although Lucene data is stored on disks rather than in memory, it is very efficient due to efficient Word Segmentation Algorithms and index structures. According to some discussions on the internet, when the data volume is slightly larger, for example, dozens of GB, Lucene's efficiency will decrease very quickly. In fact, this is not a scientific statement, from what I have seen with my own eyes, there are hundreds of GB of data in Lucene, and the performance is still outstanding. I have had a deep experience in performance optimization over the past few years, that is, the performance of a software is not only determined by its own, in many cases, different users of a very efficient software have different effects. Therefore, when the performance of the third-party software we use fails, do not rush to the conclusion that it is a problem with the software, we need to find out from ourselves whether we actually use it correctly.
In addition to using third-party Search software such as Lucene, we can also develop Search software that is more suitable for our own application scenarios. Like my current company, I have independently developed a set of high-performance distributed Search software with pure memory storage that is more suitable for my own application scenarios, this allows various application systems to make many efficient and more personalized features. After years of technical and experience, we have developed into another application system data source that is tied with databases.
Of course, the technical threshold for self-developed Search software may also be relatively high, and there are not many development teams with this technical strength. Therefore, we must make a good evaluation of all aspects before deciding on self-developed software. However, if we cannot implement a general Search software, but it is only for some specific functions, it may not be as complicated as we think, moreover, in today's open-source world, there are countless types of software. using existing tools and self-tailored secondary development, the implementation of some specific functions may be easier.
After the Search software is added to implement the efficient full-text Search function, we can use the following figure for our architecture:
15.5 use distributed parallel computing to achieve high-performance computing of large data volumes
When talking about large-scale high-performance computing with large data volumes, many people may think of a keyword recently popular across the IT industry: cloud computing, or "Grid Computing" a few years ago ".
A friend suggested that I change "Distributed Parallel Computing" to "cloud computing" in the title of this section, and I still did not change it after repeated consideration. To be honest, from my personal understanding, both "cloud computing" and "Grid Computing" are essentially the same. They are all "Distributed Parallel Computing ", it's just some "Concept Games" played by various commercial companies to attract everyone's attention. I personally think it is purely commercial. Of course, some people may think that every "Upgrade" from "Distributed Parallel Computing" to "Grid Computing" to "cloud computing" is an extension of the available computing resources. But such expansion is all about the concept, and the true technical implementation relies not on these concepts, but on the development of various software and hardware. What's more, the initial concept of "Distributed Parallel Computing" does not limit the methods in which we can only use resources.
Currently, the most popular distributed parallel computing framework is Google's MapReduce and Yahoo's Hadoop. In fact, Google's MapReduce + GFS + BigTable and Yahoo's Hadoop + HDFS + HBase architecture should be more accurate. Both are composed of three components responsible for different functions. MapReduce and Hadoop are the same functions for solving task decomposition and merging. GFS and HDFS are both distributed file systems, solve the infrastructure problem of data storage. BigTable and HBase are the same class database modules that process structured data storage formats. The three modules work together to form a distributed parallel computing framework.
In fact, the implementation principles of the distributed parallel computing architecture framework of the two Internet giants are basically the same. The preceding task decomposition and merge engine is used to separate computing (or data access) tasks into multiple tasks and send them to multiple computing (or data) nodes for computing, each of the following nodes uses a distributed file system as the basic platform for storing computing data. Of course, both pre-computing and post-computing data are organized through modules such as BigTable and Hbase. The three constitute a complete whole and are mutually dependent. Of course, what I have to say is that Hadoop is also developed by Google based on the principle of MapReduce. However, Google's contribution to Open thinking is often limited to the form of papers, and there is little information about its own technical architecture.
In fact, in addition to the two heavyweight distributed computing frameworks, there are also some complete solutions that fully utilize the existing open-source database, such as the BI solution implemented by Inforbright and MySQL, greenplum and Sun use the Greenplum System Implemented by PostGresql open source database, and both systems are implemented by MapReduce theory.
Although these two systems are not as scalable as the previous two distributed computing frameworks at present, they are targeted at real DB scenarios, and data access interfaces fully implement SQL specifications. It is no doubt tempting to use the analysis methods used to play the database through SQL statements. What's more, these two systems basically do not need to be developed and are already a complete product.
Considering the length and not the focus of this book, we will not discuss the relevant technical details here. If you are interested in this content, you can learn more and more comprehensive information on their respective official websites.
A database is only a tool for storing data. Its particularity is that it can persist data and provide unified and standardized access interfaces. In addition to databases, we can also use many other data storage and processing methods, combined with various data storage and processing methods, to give full play to their respective features, foster strengths and circumvent weaknesses, and form a comprehensive data center, in this way, the scalability of the system's data processing system can be maximized, and the performance can be optimized.
From: MySQL performance tuning and architecture design, Jian Chaoyang
Reprinted please indicate the source: