Large-scale Web services development technology One of the reading notes

Source: Internet
Author: User
Tags mysql query

As a web development of the small white, the book is very instructive to me.

1. The contents of this book are explained in the beginning of this book is the content of 1. What is large-scale Web services development? 2. What are the basic ideas and priorities for dealing with large-scale data issues? For example, cache caching mechanism, large-scale data in the case of database application methods. 3. Selection of algorithms and data structures 4. What to do when the scale exceeds the RDBMS processing power?
The above content is not too much to emphasize throughout the whole book.
2. How large is the size of Web services? 1. Millions user registration, millions, independent users (unique user) access to the number of Tens 2. Level billions of request/month, BILLIONS3. Busy traffic, hundred-gigabit BPS4. Hundred hardware server (this parameter will be different with the evolution of hardware)
Level classification of scale Google, Facebook, Yahoo, YouTube, etc. are ultra-large, the server up to millions of
What is the difference between a small-scale service and a large-scale service, and what can happen when the scale increases? 1. The need to ensure scalability and load balancing scale-out: low-cost vertical scaling
Problems with horizontal scaling?
A. How are user requests allocated at low cost? Load Balancer-------Application Server (user request processing)
B. How is the database synchronized? Data Synchronization--------Database server
C. How is network latency handled?
2. Guaranteed redundancy of 3. Low-cost operation and maintenance 4. Standardized development
What is the flow of data from a Web service to a computer? What are bottlenecks? Disk----memory----cache--CPU
What are the speed differences between the layers?
3. What are the difficulties with Web services? 1. Data volume increase (retrieval, storage time-consuming) 2. Increased number of writes and reads
User increase, when a server can not meet the user processing requests, what is the solution to the idea? Open source Load balancer, now the basic Web HTTP server has a good load-balancing capability, such as Nginx
http://zyan.cc/post/306/
http://blog.csdn.net/zmj_88888888/article/details/8823474

The cluster server uses zookeeper to achieve routing and load balancing. Various cluster monitoring software, etc.
How do I design a Web service? Minimize the start, but also consider predictive design and management. (contents of the book)
4. In the face of large-scale data, what are the initial problems? 1. The size of the table in the database increases sharply, such as reaching dozens of, hundred GB, the number of records has billions of billion2. When querying all data, the memory capacity is not 3. Linear traversal query, time is too long
5. Memory and disk hardware characteristics, which factors need to be considered when processing data? The amount of data is large, when the memory can not be accommodated, there will be a serious bottleneck, because: disk read speed is too slow, memory and disk speed difference of up to 10^6 times, million level.disk IO becomes a bottleneck in data processing. The reason for the slow disk is that the machine scans the data in a mechanical way.
Bus Transmission Speed Difference: memory--> cpu:7.5gb/s, now MORE: Frequency * Number of channels * byte width disk--> Memory: + MB/s
is also a factor to consider.

Even with solid-state drives, bus speed differences are present.

How can I view and measure the average server load? Top, Uptime command
How do I find the bottleneck of CPU and IO? SAR:CPU usage Rate


Vmstat:io Wait Rate
Too many program Access I/O requests, or too frequent page interactions, cause frequent disk access.
Solution ideas: 1. The ability to increase memory to ensure that the cached area method continues to be valid 2. Whether the volume of data is too much 3 of the book. There is no need to modify the program's I/O algorithm
6. What are the factors associated with scalability? What are the factors that are irrelevant? 1. With cost and flexibility in mind, scale-out is more cost-effective.    2. How can I ensure the scalable line of CPU load? Add the Web program server.    This is because the Web program server works by accepting HTTP requests, querying the database, processing the data into HTML or JSON back to the client, basically consuming only the CPU. and the database server, the basic need more IO resources.
CPU Load Scaling:Add application servers of the same structure, load balancer to scatter requests. 3. Three layer structure: Proxy Server---Application server---Database server load Balancing, can be done by proxy server. "Reverse Proxy load Balancing".
Operation class, logic processing, by the application server to complete. CPU consumption is high.
The data read-write class is done by the database server. IO consumption is large.Io class Extension:With the extension of the database and large-scale data processing methods. Specific ways:Optimize the writing of the database.Optimized data processing algorithms.

7. What are the priorities for dealing with large-scale data? 1. How much data processing can be done in memory? Minimize the number of disk data read and write, and try to complete the data processing in memory. Minimizes the number of disk seek times.
Distributed processing (chunked processing).
2. Optimization of algorithms to reduce time or space complexity

3. Data compression and search technology help
Background knowledge of ================ cache ============What are the basics of dealing with large-scale data? 1. Operating system Cache2. The RDBMS application based on the distributed premise3. Algorithms and data Structures
8. The caching mechanism of the operating system (why is caching an efficient way to process data?) ) Target and Resource 1. Data is stored on disk, how to achieve high-speed data access?   Memory is the disk access speed of 10^5 to 10^6 times. 2. Solution: Use memory as much as possible.
How memory works: the caching mechanism.
Virtual memory---address transform--Real memory
Cache Base Unit: page, 4KB data
Virtual File System--(page caching mechanism)--virtual memory
How can I quickly determine if a file is already in memory? Address of file in virtual memory: file Inode and offset as hash key value.
How memory is used: Linux will use all the free memory for caching. Sar-r can view memory status
How is the data read into memory by disk? Page caching mechanism: its advantages, the face of the problem
9. From cache to IO load, what is the relationship between cache and IO load reduced IO load policy: Reduce IO load on cache basis.
Whether to increase memory, the factors to consider: 1. If the physical memory is larger than the data, consider the full cache. 2. Balance the size of the memory with the cost.
What should I do when memory is not fully cached? You can consider increasing the database server, more important to the database is the IO load, increase the cache capacity, improve efficiency. (CPU load sharing, simply add the server can be)

For IO load, additional servers need to be considered Local locality
Simply increase the cache, generally still can not cope with the rapid growth of data volume. The cache is still a bottleneck.
When the database server restarts, the cache is emptied, and when there is a large number of requests, what happens? For IO-intensive servers, no cache, there will be a severe IO bottleneck, the system frequently accesses the disk. Workaround: After reading the database cache, put it back into the production environment. 10. What does the concept of distributed locality based on local locality mean? For a fixed access, you can use a fixed server to handle, while multiple fixed access is split, placed on different servers processing. Increase the cache by increasing the database server, and adopt some policies to improve the efficiency of cache utilization.
Split way: 1. Access split 2 by table. Split by Data ID, which is the disadvantage of partitioning by index: Data merging may be required when the granularity of the tessellation changes.
3. Partitioning by access mode, such as useragent and URLs based on HTTP requests, makes it easy to differentiate between "hot" and "unpopular" requests. The difference between crawler access and user access can reduce the experience of the crawler's request processing and improve the user's request experience. User requests can be handled separately, and their caches can be on a single server, improving server efficiency.

============= database-related background knowledge ===========11. What does an index do in a database? What are the roles in distributed databases? MYSQLL is an RDBMS that, after an index is added, can quickly return data to a database query. This is the ability of the database's own data structures and algorithms.
The main points of the distributed MySQL application are: 1. The flexible use of caching: whether to increase memory, whether to increase the server to increase and classify the cache 2. Set index 3 correctly. How do I design if I scale horizontally?
Cache-Related factors: 1. Whether to increase the memory and whether to increase the server to increase and classify the cache
2. How does the design of the table structure affect the use of memory in large-scale data? 300 million records, if you add a 8-byte column, the amount of data will increase by 300 million * 8B = 3GB. If the depth of the table is particularly deep, you need to consider changing the structure of the table.

3. Data normalization, splitting a table into multiple tables, what are the pros and cons? Pros: Normalized to reduce the size of the data occupying memory. Disadvantage: When you federate a query, the query speed is reduced.
In between, look for a balance point. Detach the necessary fields from the non-essential fields.  Index principle: B + Tree B + Tree is a multi-fork balance tree. The effect includes two aspects: multi-fork: It is advantageous to control the data of one node around 4KB, which is beneficial to disk reading and storage. Minimize the number of seek paths. Balance tree: Facilitates fast query. Search complexity of O (Log (N))
Which columns are appropriate for indexing? WHERE, ORDER by, the column specified in the GROUP by condition.
Which indexes are valid in MySQL? 1. Explicitly specified index 2. Primary KEY, UNIQUE constraint
The MySQL query used to use only one index at a time, and I don't know if there are any improvements.
Add the explain command before the query statement to determine whether the SQL index is valid.


12. How do I design a system when using distributed MySQL? How is the reality MySQL distributed system? MySQL has a replication replication feature, which is the basis for implementing its distribution. 1. Prepare more than one server 2. One of them as Master, others as Slave3. Synchronize data (update data) from the master server to the slave server by polling 4. queries through slave servers (load Balancing makes it easy to assign queries to different slave servers) select5. Data submission, update 6 through the master server. Back to 3, loop processing
General data operations The ratio of select to insert is 10:1. Therefore, the above system structure, slave server can easily expand, but the master server does not.
How do I implement a frequently-written, extensible database server? 1. Table splits, reducing the size of a single table by 2. Key-value storage mode, NoSQL database, to replace RDBMS, this kind of database is good at writing and reading operation, do not need complex relational operation and statistic sort processing. Therefore, the cost is low, the speed is fast.
How does MySQL scale out horizontally? Summary: 1. Put in memory 2. Increase memory by 3. Partitioning
Limitations of the partitioning policy: when two tables, such as entry and tag tables, are placed on different servers, the join operation cannot be performed on the two tables. MySQL does not support cross-server join operations.
How to implement cross-server, multi-table federated query? Workaround: 1. If the real-time requirements are not high, according to business needs, the establishment of intermediate tables, periodic updates to the intermediate table data. Query the Update table so that it will be fast. 2. If the real-time requirements are high, to be investigated.
Instead of using join operations, you can develop specialized algorithms for querying data. Or use a dedicated SQL statement, such as where ... in ... Replace join.
Partitioning also has a price: the complexity of operations and the rise of failure rate partitioning is the last treatment.
For scale-out systems, a redundant backup, at least several servers required? 4 units. 1 Master + 2 working slave + 1 Backup units

Large-scale Web services development technology One of the reading notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.