This is a creation in Article, where the information may have evolved or changed.
2013-09-14
Think2go Gordon Camp First Review
- The application of Go language in CDN download system
- The application of Go in micro-blog data analysis
- Golang and high-intensity online services
The application of Go language in CDN download system
Today, go to the offline activities of Go language Shanghai, here to do a simple review. Just personal understanding, may be wrong, shoot bricks lightly.
First of all, thanks to the big debut, for us to share his use of the go language in the Grand CDN System application, we applaud.
I think the main content can be divided into two parts, part from the central node to the IDC file distribution process, and the other part of the user request to arrive after the scheduling design.
The main application scenarios like what game client distribution and so on. First, the distribution process of the central node server to the IDC server.
We know that the domestic network environment of various wonderful, the furthest distance is not the city to the physical distance of the city, but I use the letter, you used Unicom, or he is education network and so on. IDC room is from the major operators of the network is relatively smooth, such a node is about 6 (do not remember is not). There is a central node server. The whole process is to distribute the files internally, first uploaded via FTP to the central server, then the central server distributes the files to each IDC, and then IDC to the next level distribution.
Let's talk about their previous plans and problems. They used to make seeds for the papers to be distributed, using Ctorrent. Let the IDC server go down from the central server, and then bt the way each server can download between each other. But there are two major problems.
The first problem is, from the central server to the IDC room, as well as between the IDC room, this network traffic is to calculate money, and the BT protocol is distributed, the internal mechanism is not controllable, such as unable to control its tracker, can not prevent an IDC server ran to another IDC to request data, These traffic are extra money ah ... And the same IDC inside the server between, is internal, this traffic is not money. So the way BT compares pits.
The second problem is that Ctorrent has bugs, and often files are passed to 99% and die. This is not a good solution, according to the following Han To, the current open-source BT algorithm implementation of most of the problem exists. So matchmaker began to make this piece himself.
From the central node to the IDC distribution process, divided into two layers of transmission process design, the first layer is the central server, after the file is partitioned, the blocks are uploaded to a different server in IDC, so that within an IDC has all the file data shards.
The second layer is between the servers within IDC, exchanging data with each other until each of the IDC servers has a complete file. For example, the central server divides a file into four blocks, a,b,c,d, and then passes it to the server S1,s2,s3,s4 in IDC ... Then the data is exchanged between the S1,S2,S3,S4, and eventually each machine has a complete file. The advantage is IDC internal server transfer data, traffic not money, light rushed this point, I think Shanda should give Chedal bonus, hehe.
There are some small details, such as the first layer of transmission and the second layer of transmission can be synchronous, do not have to wait until the entire file is uploaded to IDC to complete the second step of mutual exchange. The central server is the control node, and each IDC server sends a message to the server after it receives a piece of data. The server can give it a return message and let it perform a second layer of transmission. The central server is a master-slave, if it falls off the switch standby.
IDC server There is MD5 checksum information for each file block, and a. astaxie file is cached locally, recording the completed block. This way, when a task fails to retransmit, only those blocks that failed are transmitted, not all retransmissions.
The size of the file block is in 8M units, which is the operational dimension is given an empirical value, formerly 4M. Why is the experience worth it? Network fluctuation, broken chain, all kinds of problems will be frequently encountered. The chunk is big, the loss after the failure is larger. And if the chunk is too small, the number of transfers and the number of communications will increase, the time to pass a file will be longer, so this is an experience value. Then the time-out retransmission, set the timeout is 10 minutes. For example, if a block is not uploaded from the central server to the IDC server, the time-out will be retransmitted (for a server?). )。
Matchmaker gave us a little bit of code when talking about these things.
The last part of the content is basically the internal Transport Section, distributed from the central server to IDC. Next is another part of the Scheduler Design section.
Scheduler design is to consider, according to the network situation, geographical location, the current server load and so on, to a download request, decide which machine to take to provide users with download services.
CDN Basic technology, is through the user's IP segment, find out which network he belongs to, telecommunications, Netcom? The server then assigns the appropriate network to the user to provide the download. Their previous practice was to randomly assign a service to a user once they had found a server with the same network. The problem with random allocation is that the load on the server is unbalanced, and maybe some machines are busy, while others are idle.
Shanda has an IP library, recording the various IP segments in the network, the corresponding allocation server. This in the code matchmaker is stored with the TREAP data structure, TREAP is a KV data structure, through the binary tree search, through a random weight to ensure the balance of the tree. I do not yet understand why the TREAP data structure is chosen here. Is there a relationship between the node weights using the TREAP data structure and the server load? No, I went to the WC, and there was something missing.
The Load allocation section is now added, adding state to the server. For example, choose a machine with low load on the network server, if each machine load is medium, pick one randomly. If all to full load, then is not confined to the same network, from the global server randomly pick one, always do not return to the user 404 bar.
According to matchmaker, after the implementation of the Go language, the current system compared to the previous transmission speed greatly improved, the transfer of large file speed is almost 10 times times the original, small file upgrade also has 30%. User downloads have also become noticeably faster. Finally, we discuss the optimizations that can be made in the next stage. One thing is to handle the bandwidth between the upstream and the download. Sometimes the dozens of G file task pops over and there is no limit at this time, which can take up a lot of bandwidth and affect user Service already in progress.
I asked them if their CDN servers were independent. Because the server is running other services, the optimization just mentioned is meaningless to control the upstream download bandwidth within the system, because other services do not necessarily cooperate to limit their transmission bandwidth. The answer is probably that their CDN server is independent, just run the related service. And then I asked again about the file fragment transfer, it is possible that a piece is not successful, whether to consider a piece of data files at the same time multiple servers, there is a server first return success even if successful, and then cancel this piece of other transport tasks. Xie Da's explanation is that this piece does not do, will complicate the system, now simply do time-out processing.
Time is limited, I only mention one or two questions by the rice crust GG banned. Do not think I ask the question is not level ah ... Have you got any wood?
Actually ask the first question, which kind of service the server runs, there are two kinds of processing. One trend is custom-made hardware to run this service on its own. In the past when the Baidu to see their share, their data center has a dedicated choice of hardware, such as the CPU requirements are not high, mainly network IO and disk IO, directly with ARM CPU, blade server. Then some scenarios may also use SSD drives. Their optimization is really to squeeze the limits of hardware, even to save costs, memory is not to buy memory chips, direct purchase of memory particles. I do not know whether Shanda has to do with different services specifically to buy hardware this degree. Another trend is to balance the hardware, and then optimize the software side, so that a server running all kinds of CPU and IO-type tasks, try not to let one of them become a bottleneck. Google's side of virtualization is biased. Obviously the application scenario of CDN, CPU is not a problem. According to matchmaker, the bottleneck is on the hard drive. As for the other question, I remember the GFS paper, which seems to be a multi-part post, which mentions that due to the different hardware between different servers, the network environment is different, there will be some transmission time is exceptionally long, so it will be transferred to multiple, one of the successful return to cancel other tasks. But also the application of the scene, do not delve into.
Pull away ... Well, so wonderful to share ... Let's clap again.
The application of Go in micro-blog data analysis
Then there is the sharing that Shaotianyu brings. In fact, I personally think that his share of content and the theme of the go language is not too good, personally think that many things he did in the project to choose another language, other open source library may do better, and did not highlight the advantages of the go language, choose Go is only his strong personal preference, this I am in a reserved manner. This is the first time this classmate to do this kind of sharing, no matter what, the use of Go is also a Go language practice, and the content of I think it is more exciting.
Micro-blog Data analysis, I think it can be divided into the following parts to see, first data source acquisition, followed by data storage, and then data analysis.
He first introduced us to the selection of his open source library. Data source acquisition, he is to write his own crawler crawl micro-blog data, showing us the use of Go interface here, a URL plus a handler. Word breaker and index before he tried to use Wukong's Go language open Source Library, but the library has a problem is not to do persistence, the data are all stored in memory. The memory footprint is very large, and after communicating with the author and not getting a satisfactory solution, use ES instead. Search (the name is not clear). Listed a lot of open source projects, I believe he has done a lot of research work.
Also mentioned that they used to be the system is PHP, hardware with 16 core cpu,32g memory, and now use the go language only with the normal PC can run. He also cited a lot of data, the number of active users of Weibo, the number of capture records, various ... Anyway is to use data to speak, unknown to the calendar ah, hehe.
The data storage section is used by Leveldb, who feels that the performance of MySQL is not trusted and that the current scenario is just some simple storage and querying, leveldb more appropriate. They now crawl to the data has 700G, the number of records I forgot. After the rice crust comes out to clarify, and can not say that MySQL is not suitable here, MySQL is a storage framework, as to the specific use of what kind of storage engine can be selected according to demand.
The Data Analysis section, using the label propagation algorithm, this should be considered a very basic machine learning algorithm. For example, Weibo users may add a variety of tags to themselves, such as it,80 and so on. The tag propagation algorithm learns which circle a user belongs to, through the relationship between the existing user tag and the user. This algorithm is a bit of a domain trend, no basic students do not necessarily understand. But fortunately, you can not understand the learning of the more things. This algorithm is also used by the go language itself, and not to find some of the data Mining library use.
The training of the tag propagation algorithm takes about a day, but the data is incremental training, which is difficult to do in real-time, it is a wave of data after the incremental calculation. They are primarily for internal users, so it doesn't matter if the query takes a 1-second time.
After the question link begging dog classmate dispute crawl micro-bo data restrictions, and Sina June change API, crawl permissions and so on. Now are careful Ji Ji is afraid to catch more will be sealed number. Another student asked questions about his Open source library selection, and the leveldb when the merge was not written.
Remember Shaotianyu beginning very tense, modestly said only to do one months, not too many can give everybody to share ... One months is such a cow, this young man is very promising ... OK, let's applaud the next one.
Golang and high-intensity online services
shared by Han To, the title was temporarily changed, and he confessed that the title "Golang and high-intensity online services" was a bit too B-loaded. This sharing is relatively highly abstract, not to talk about specific projects, it is some of the use of Go experience it.
There are a lot of things that I can remember including the panic of their company that cannot be thrown to the process level and must be captured in Goroutine.
Like memory usage, do not use go Big memory (more than 1G) services. The right things to do the right thing, this is my feeling, such as go+memcached. The main is go garbage collection is not perfect, a lot of memory allocation, recycling will be card. and C language write like memcached What, certainly more professional.
HTTP as the most basic communication protocol.
CGO is to avoid the use of, even like audio video transcoding such as, only C library, their practice is to use the C program written services let go to tune.
What else is the memory alignment, most of the seven cattle companies have stepped over the pit after the convention of some use habits. I was impressed by their log processing, they rewritten the log package, the provider log and the transaction log two types of logs.
The program log adds line numbers, functions and other information. Compared to the cow B is their transaction log, almost all function header one parameter is log. Instance, a layer of function calls adds a layer of records, and they track all log records. Logs can be used for locating bugs, for auditing, even for paying customers, and so on. When a request comes in, it assigns something like Sessonid, and all of the log.* functions contain the Sessonid. These logs will eventually be collected to provide a search so that all records of a request can be traced.
Similarly there is the error, but also changed.
It's too late, just write it down here. Finally to help make an advertisement: Daniel with calf, calf with snails ... Next time to participate in the event to remember this slogan, can lead the book Oh ~