Out of interest, I recently learned a Java Web Crawler framework webmagic. In fact, I used a program to automate the download task. For example, if you want to download an image quickly, right-click to download, but it will take some time to download the first 1000 images. But the computer can automate these steps, and you only need to set rules. After a few days of hard work, I was eager to crawl the pages of zhihu, and wanted to download the answers with more than 1000 likes. When the program is started, it receives a message from zhihu's server: 403 Forbidden error and rejects my request. After joining the web crawler QQ group, I learned that I had to pretend to be a "user" to cheat the server. Otherwise, the server rejects similar requests by default.

This made me interested in the network protocol. I entered the address in the browser and press enter to access the page. Essentially, the effect of a line of commands on the computer is not much different, the server does not know this. What it knows is that there is a connection request from Changsha, Hunan. In the past, the server did not know whether the request was sent from a real user or a disguised crawler. As the saying goes, "on the Internet, no one knows whether you are sitting on the opposite side of a human or a dog ". So, from the moment I press enter to the moment I know that the pleasing page is displayed on the browser, what happened behind this? I remember that when I was on the computer network, I also heard about the layer, router forwarding, transmission latency, man-in-the-middle attack, and other TCP/IP protocols. However, I have already returned all of them to my teacher. I just read the graphic HTTP book, which briefly introduced the concepts of protocol layering, HTTP status code, HTTP header information, and WEB security. Most of them are quite understandable, next, I will send the Mind Map of the first seven chapters I have summarized. If you are interested, please download it.


P.s. The first time the webmagic Code was submitted, it was the program that crawled on the page, so excited :).


