1. Differences from small-scale services with only a few servers
Scalable Load Balancing
Ensure Redundancy
Reduced CED Operation: reduces manual intervention (too many machines can't be remembered)
2. Difficulties in large-scale Data Processing Memory vs Disk
Memory is 1 million times faster than disk
3. Techniques for large-scale data
Write programs
Complete in memory whenever possible
Use algorithms that can cope with data growth (Binary Tree O (logn ))
Use data compression and search technologies
Prerequisites: underlying foundation
Operating system cache
Distributed architecture is a prerequisite for applications that must be done by rdbms
How to use large-scale environmental data structures and algorithms
4. operating system cache
Virtual Memory: The process does not directly use the memory address, the kernel is transparent, and access starts from 0. The operating system allocates memory on the page.
Page cache: The data read by the process into the memory is not directly released and cached for later use.
Vfs: the disk cache is implemented by the page cache. vfs is responsible for shielding different underlying file systems for caching.
File Cache: LRU is used. The minimum unit is page size.
Reduce I/O load: if the memory is greater than the data file, all data can be cached, and data compression is not considered.
Local distribution: implement distributed data based on the Access Mode
5. Database horizontal scaling policy
Key Points of distributed mysql
Flexible application operating system cache
Set indexes correctly
Design applications on the premise of horizontal scaling
Mysql distributed
Replication: master slave
Extended update/Write: Table segmentation, key-value
6. Special-purpose Indexes
Inverted index: used for full-text search. You can create an Index Server separately.
7. Full-text index implementation
Step: Create an index in the crawling storage to display the search score
Reverse index structure: Directiory + Position
Directory Creation: dictionary + AhoCoraSic or elemental analysis
8. scalable ideas
Load Optimization: Visual Load
Considering machine purpose: Crawlers
9. Ensure Redundancy
Application server: Increase the number of servers. server Load balancer implements failover and recovery failures.
Database server: increases the data volume. multi-master instances have the risk of non-synchronization during replication switching. If this is ignored, manual recovery occurs.
Storage Server:
System Stability
There is a trade-off with resource utilization: maintain the appropriate margin
Resource increase and Memory leakage may affect: Automatic dos checks abnormal restart termination time query
10. Improve Efficiency
Virtualization: scalable, cost-effective, and highly available. Disadvantage: performance overhead, cpu2 % memory 10% network 50% IO5 %
Effective use of cheap hardware: multi-core cpu SSD hard drive
11. Network
Demarcation point: 1G bps 500 global host CDN
Benefit life: Understanding the focus of Server Load balancer: operating systems, caching, multithreading/multi-process, virtual memory, and file systems
View single server load: Check the average load to determine whether there is a CPU/IO bottleneck
Average load: top uptime, waiting for cpu + waiting for io tasks/unit time
CPU bottleneck: sar vmstat
Top/sar check whether the process is a user process or a system process
Ps: view the process status and cpu usage time, and confirm the problematic process.
Use strace or oprofile to locate the problem after determining the process
IO bottleneck: the disk is frequently accessed due to too many io requests or page switching. The status of the SWAp zone is confirmed by sar or vmstat.
If page switching occurs
Ps check whether a large amount of memory is consumed
Program reasons, improve the program
The memory is indeed insufficient to add memory, so it cannot be increased to consider distributed
There is no exchange, io is frequently due to insufficient Cache
Increase memory
Memory cannot be increased or is not enough. Consider distributed storage or adding cache servers.
Operating system optimization is to find and solve the bottleneck.
This article from the "Ying: Good memory as bad pen" blog, please be sure to keep this source http://yingtju.blog.51cto.com/3760152/1299911