I used to use csdn and google blogs in the past. Recently I saw my friends using the blog Park and simply applied for one. I am a cainiao and don't like it. I hope you can give more advice.
This article is a solution written when processing a project with another friend.
To be honest, the 40 W data is not big, but the processing method is similar. Not much nonsense.
I. Database Structure Design
First, designing a rational database model will not only reduce the programming and maintenance difficulties of the client and server segment programs, but also greatly increase the performance of the system. Therefore, a complete database model must be designed before a system is implemented.
Ii. Query Optimization
Ensure that the number of accesses to the database is minimized based on the implementation of the function. By searching for parameters, the number of access rows to the table is minimized, and the result set is minimized, thus reducing the network burden; operations that can be separated should be processed separately to improve the response speed each time. When using SQL in the data window, try to place the indexes in the selected first column; the algorithm structure should be as simple as possible; during query, do not use wildcards such as SELECT * FROM T1. SELECT COL1 and COL2 FROM T1 if you want to use several columns; if possible, try to limit the number of rows in the result set as much as possible, for example, select top 300 COL1, COL2, COL3 FROM T1, because in some cases users do not need that much data.
If no index is created, the database must perform a full table scan to search for a certain data record, traverse all the data, and find the matching record.
3. Algorithm Optimization
Avoid using a cursor as much as possible because the efficiency of the cursor is poor. If the cursor operation has more than 10 thousand rows of data, you should consider rewriting .. Before using the cursor-based or temporary table method, you should first find a set-based solution to solve the problem. The set-based method is generally more effective. Like a temporary table, a cursor is not unavailable. Using a FAST_FORWARD cursor for a small dataset is usually better than other row-by-row processing methods, especially when several tables must be referenced to obtain the required data. A routine that includes "sum" in the result set is usually faster than a cursor. If the development time permits, you can try both the cursor-based method and the set-based method to see which method works better.
A cursor provides a method to scan rows in a specific set. Generally, a cursor is used to traverse data row by row and perform different operations based on different conditions of the retrieved data. In particular, the multi-table and Big Table-defined cursors (large data sets) cycle can easily make the program into a long, special, or even dead-end.
In some cases, you sometimes have to use a cursor. In this case, you can also consider transferring qualified data rows to a temporary table, and then operate on the temporary table definition cursor, which can significantly improve the performance. Encapsulate stored procedures
4. Create efficient Indexes
You can create an index for the following purposes: Maintain the uniqueness of the indexed column and provide a policy to quickly access data in the table. Large Databases have two types of indexes: Cluster indexes and non-cluster indexes. A table without cluster indexes stores data in a heap structure, and all the data is added at the end of the table, A table with a cluster index is physically stored in the order of the cluster index key. A table can only have one cluster index. Therefore, according to the B tree structure, it can be understood that adding any index can increase the query speed by index column, but it will reduce the performance of insert, update, and delete operations, especially when the Fill Factor is large. Therefore, tables with many indexes are frequently inserted, updated, and deleted. When creating tables and indexes, a small fill factor is set, in this way, we can leave more free space on each data page to reduce page segmentation and re-organization.
Indexing is one of the most efficient ways to obtain data from a database. Index technology can be used to solve 95% of database performance problems. As a rule, I usually use a unique group index for the logical primary key and a unique non-group index for the system key (as a stored procedure, use non-group indexes for any foreign key column [field. However, the index is like salt, and too much food is salty. You have to consider how much space the database has, how tables are accessed, and whether these accesses are mainly used for reading and writing.
Other considerations
The index is also the same. Indexing helps improve the search performance, but too many or improper indexing will also lead to low system efficiency. Because every time a user adds an index to a table, the database needs to do more work. Too many indexes may even cause index fragmentation.
Therefore, we need to establish an "appropriate" index system, especially for the creation of aggregate indexes, so that your database can enjoy high performance.
Solution: create an intermediate table and schedule Daily Data steps through DTS.
Design principles for intermediate tables
The number of records is the same as that of the original table. Multi-Table connections are reduced and the calculated values are saved. If the records are modified, the median value is recalculated based on the modified logs.
Incremental synchronization data (DTS)
Reads records directly from the daily Database Change log, updates the intermediate table, and schedules DTS Based on the server space to reduce the data synchronization time.
Calculation of intermediate data
The query is not executed by fields, so the calculation has been calculated during the process of generating an intermediate table.
Optimize index Design Based on queries
Index the where, group by, and other operation fields based on the data query features to improve the query speed.
Optimize Data Types
Use a large number of Int to increase query and Statistics speed
Optimize intermediate table keywords
Use Int to increase insertion speed
Data File Optimization Design: a major business, a data file, estimate the data volume when creating a data file, and create a large file at a time. In this way, the allocated file is a continuous file block.
V. JVM Optimization
The system uses Java + Tomcat + Mysql to optimize the JVM configuration of the Java Virtual Machine. Generally, the current garbage collection method of JVM is implemented by generational collection. Configure Java-Xmx3800m-Xms3800m-Xmn2g-Xss128k-XX: + UseParallelGC-XX: ParallelGCThreads = 20 to configure the memory and concurrency capabilities. Detailed configuration is not described. JVM Optimization requires on-site O & M testing and optimization.
Vi. Buffer Pool Technology
For a system with 0.4 million users, it is inevitable that the system will carry up to 0.4 million concurrent online requests. It is estimated that up to 0.4 million concurrent online voting requests will be carried. Buffer Pool processing is performed when three processes consume the most resources. One process is to verify the user name and password when a user logs on; the second process is the amount of access that the system receives during the user's voting period. The third process is message push, and multiple threads are used to push users in queue.
VII. Static Page Technology and Load Balancing Technology
The data submitted by the user is refreshed locally, that is, only the data submitted by the user needs to be used. Some unchanged data does not need to be submitted or downloaded.
(1) Server Load balancer is used to process client requests, which can handle tens of millions of concurrent requests.
This chapter describes some outline schemes for processing large concurrency of the system. The feasibility of the scheme is described through the tested data. The Mysql database uses the SELECT * Statement to query 0.5 million entries in full for a period of 1.130 seconds.
It takes 0.564 seconds to directly run an SQL statement in the database to query the id. in java, it takes 0.52 million seconds to query 2.776 pieces of data using a database connection.
Diagram of database usage using SQL statements
When you use java to operate databases to query all IDs
After the optimization, it takes only 1.070 seconds to log on to the system. Therefore, it is not a problem to log on to a large number of users.
8. ehcache Cache Technology + LRU and other elimination algorithms
This system is a high-concurrency, large data volume, and high-performance system. Therefore, cache plays a very important role. ehcache is a pure Java in-process cache framework and features fast and efficient. I used a variety of Java-based Open Source Cache components, including OSCache, JBossCache, and EHCache. OSCache is powerful and flexible. It can be used for Object Caching, Filter caching, and direct use of cache tags in JSP. In the recent use process, we found that when the concurrency is high, OSCache may encounter thread blocking and data errors. By analyzing the source code, we found that it is a defect of its internal implementation. The biggest advantage of JBossCache is that it supports cluster Synchronization Based on Object Attributes. However, the configuration and usage of JBossCache are complicated. When the concurrency is high, synchronization of object property data in a cluster also increases system overhead. EHCache is an open source project from sourceforge (http://ehcache.sourceforge.net/). It is also a simple and fast Cache component implemented in pure Java. EHCache supports memory and disk caching, multiple elimination algorithms such as LRU, LFU, and FIFO, and distributed Cache. It can be used as a Cache plug-in for Hibernate. It also provides a Filter-based Cache, which can Cache the response content and use Gzip compression to increase the response speed.
9. commons-pool
Creating and initializing new objects may consume a lot of time. This is especially true when initialization involves time-consuming operations (for example, reading data from a host other than 20,000 km. When a large number of such objects need to be generated, it may have some negligible impact on performance. To alleviate this problem, in addition to selecting better hardware and better virtual machines, it is also an effective countermeasure to appropriately adopt some encoding techniques that can reduce the number of object creation times. Object Pooling is a well-known technique in this field, while the Jakarta Commons Pool component is a powerful foreign aid for processing Object Pooling.