Some time ago, I read the book "large-scale Web Service Development Technology". Today, I used the afternoon to repeat it again, and wrote down the key points, so that I can review and forget it. As you are still familiar with data compression and full-text retrieval, note content mainly involves the first five chapters, followed by sporadic notes. This article may be helpful to the following: 1. I am interested in this book, but I am not sure about the content. 2. I have some experience in large-scale Web Services.
Hatena Scale (April 2010)
- 150 million registered users, uu1900w/month
- Traffic during busy hours: 850 Mbps (excluding images)
- There are 600 hardware (servers) and over 1300 hosts through virtualization technology.
- The number of logs per day (GB to TB)
System Growth Strategy
- Management and design for minimal start and foreseeable changes
Balance efficiency and quality
- Meeting, standardization, documentation, agility, etc.
The text database of GB-level (10 million) does not need to be indexed, and a SELECT query fails to be completed within S.
Speed Difference between memory and Hard Disk
- Addressing: the former is 10 to times the latter.
- Transmission speed (bus): the former is 7.5 g/s, and the latter is 58 m/s.
Find the bottleneck of a single machine (use the performance of a single machine, do not speculate, to measure)
- SAR or vmstat check whether the problem is CPU or I/O
- CPU Problems
- Top or SAR: Check whether the process is a system process or a user program.
- PS: view the process status and CPU usage, and determine the problematic Process
- Use strace or oprofile to locate the specific problem of a program or process
- I/O Problems
- Frequent page switching ---> insufficient memory
- PS: view the memory used by the program
- Can programs be improved to reduce memory usage?
- No hardware or distributed
- If none, the cache memory may be insufficient.
- Increase memory
- Add machines, distributed
CPU expansion is more convenient, but Io load expansion is more difficult
- View the actual load: Load average in the top result (1 minute 5 minutes 15 minutes)
- Check whether the IO load is too high or the CPU load is too high: Sar-P (multi-core)
Focus on processing large-scale technologies
- Try to implement it in the memory to achieve distribution and use locality
- Complexity of algorithms, O (n) --> O (logn) has a qualitative leap
- Data Compression and Retrieval Technology
Cache Mechanism
- Page Cache)
- All modern operating systems use virtual memory.
- The memory allocated by the kernel will be left as much as possible, and no disk access is required next time, that is, the page cache.
- The operating system caches pages, that is, the minimum unit of virtual memory.
- Increasing the memory can increase the cache hit rate and reduce the IO load.
- Sar command
- Sar-R to view the current memory status (the physical memory size used by kbbuffered cache)
- SAR: 1 3 times per second, 3 times in total
Policies for Io Load Reduction
- Increase cache, that is, add memory
- Extend to multiple servers
- 2. The actual cache hit rate may not be increased (the data on each machine remains unchanged), and data needs to be split (partition ).
Partition-distributed by locality
- Split from data center
- A-C Server 1, D-F Server 2 ......
- Divide the system into different "islands" by Purpose"
- Crawler
- Image API
- General access
Basic O & M rules based on page Cache
- When the operating system is started, do not immediately put it into the production environment. You must first push it to read all the files.
- Performance testing should be conducted after Cache Optimization
Database horizontal scaling policy
Flexible application operating system cache
- Minimize database size to physical memory
- Consider the impact of Table Structure Design on database size
Create an index
- Improves search efficiency (logn) and disk track retrieval times
- MySQL explain command to help check whether the index is valid
MySQL distributed
- Master/Slave design (Master update, slave read)
- But the master cannot be expanded (Data Consistency)
- However, in most cases, 90% of Web applications read queries.
- Master load can be solved through database/table sharding or replacement.
MySQL Partition
- Place tables that are not closely related to each other on different machines
- Avoid join operations on tables on different machines
- Use Inner join or where... in...
- Cost of Partition
- O & M becomes complex, failure rate increases, and cost increases
- Minimum number of machines required for redundancy
- Four -- one master and three slave
- One of the three slave servers is used to provide continuous services, one server may fail, and the other server may be used to replicate instances after a fault occurs.
Three important aspects of Web Service Infrastructure
- Low cost and high efficiency
- 100% reliability should not be pursued
- Design is important
- Scalability and response time
- Development speed is very important
- Web services are often added or changed to provide flexible resources for services.
Traffic limit that one server can handle
- Hatena Standard Server: 4-core CPU, 8 GB memory;
- Performance: thousands of requests per minute during busy hours
- If 4-core cpu x 2, 32 GB memory
Optimization
Redundancy and System Stability
Master Redundancy
- Multi-Master
- Generally, there are two servers in the active/standby structure.
- One is active and the other is standby.
- The two servers act as Server Load balancer instances. One server writes data to the other server, and two-way replication is enabled.
- When standby detects active downtime through vrrp, standby automatically becomes active and becomes a new master.
- The active server has a virtual IP address, which is assigned to which machine and which machine is the active master.
- Disadvantages
- There is still a risk of inconsistency
System Stability
- Resources should be kept at a certain margin, only about 70%
- Remove unstable factors (automate as much as possible)
- Reduce Memory leakage and restart automatically
- Self-discipline control of abnormal behaviors
- Automatic termination time query
Virtualization Technology
- Benefits
- Scalability
- Minimize additional overhead
- Dynamic migration
- Cost effectiveness
- Improve resource utilization
- Improve O & M flexibility
- Software-level Host Control
- Hatena's virtualization application
- Xen (centos 5.2, xen 3.0.3) + local disk construction LVM
- Replacing IPMI with hypervisor
- Control Resource Consumption
- High Load warning
- Adjust Load
- Improve resource utilization
- I/O idle --> Database Server
- Memory idle --> Cache Server
- Avoid consumption tends to combine the same
- Additional virtualization overhead
- Memory: 10%
- Network Performance: 50%
SSD lifetime
- Loss degree indicator: Media wearout indicator in the value of S. m.a. R. T ---> smartctl command
- It took about nine months for hatena to write the most frequently-written SSD.
Network demarcation point
- 1 Gbps, that is, 30 WPPS, is the limit of the PC Router (1 Gbps is the limit of Gigabit Ethernet, and 30wpps is the limit of the Linux kernel)
- Countermeasure: purchase expensive finished routes for multiple PC routes
- 500 hosts, the limit of subnet and ARP tables
- Countermeasure: hierarchical network
RDBMS or K-V Storage
- Judgment basis
- Average data size
- Maximum data size
- Increase frequency of new data
- Update frequency
- Deletion frequency
- Access Frequency
- MyISAM vs. InnoDB
- MyISAM
- Advantages
- A table without update or delete can also be quickly inserted.
- Start and Stop very quickly
- The table can be moved or renamed directly from the file system.
- Disadvantages
- Abnormal stop may damage the table
- Transactions are not supported.
- Update, delete, and insert locks the table (except the append data), and the performance is poor in applications with many updates.
- Applicable scenarios
- Only data appending
- Use select count (*)
- InnoDB
- Advantages
- Support transactions
- Exception stop recovery
- Execute row lock when data is updated
- Disadvantages
- Slow Start and Stop
- Table operations are performed through the database.
- Applicable scenarios
- High update frequency
- Transactions required
Cache System
- Squid
- Used as multiple (reverse) proxies such as HTTP, https, and FTP
- Access Control and authentication
- Varnish
- High-performance HTTP accelerator
- Flexible language settings
- Basically all executed in memory
- Nginx, pound ......
- Note when the cache server goes online
- If one Server Load balancer instance fails, the other server is unable to withstand the load.
- Even if you have enough servers, pay attention to them.
- New servers (or just started) need to be pushed, and the traffic increases from small to large