Simple crawlers that break through complex verification codes and IP access restrictions
Article Address: http://www.cnblogs.com/likeli/p/4730709.html
Well, look at the topic to know that I was to write a crawler, the crawler's target site has some anti-crawl awareness, so there is this article.
Let me talk about the scene first:
Due to work needs, there is usually a lot of data to be queried on the internet, and archive storage. Some time, this kind of task also arranged for me a copy. Observing a Web site, my first reaction was to use crawlers to fetch. Why is this kind of mechanical work artificial?
Because this site has anti-crawler awareness, do some work, to my crawler to crawl data caused some trouble.
Let's start by listing the problem:
- The first, the verification code, the site uses a digital plus Chinese simple arithmetic as a verification code.
- The query target path parameter is encrypted, and I cannot directly skip certain pages by taking the path and parameters.
- IP restrictions, which limit the number of visits to the IP site. After my test, a pure IP access to the site can crawl up to 40 valid data in an hour (here for my crawl target, HTTP requests almost 200 times, but if the number of accesses in 30s more than 25 HTTP requests, then this IP is directly blocked)
Well, the main problem is these, some small problems in the crawl process, it is not listed. A whole bunch of solutions in the garden. Here I am mainly talking about the problem of captcha and IP restrictions .
Of course, my solution is not a superb skill. It should all be the same old path.
1. Verification Code
Original:
This kind of verification code difficulty lies in character adhesion, character random rotation problem. These two, I used the projection histogram segmentation , the jam method to cut the character and correction angle respectively.
I first wrote a tool to test:
From above, you crossing should be able to see, my method is relatively simple and traditional, that is to do feature library, through the segmentation of characters to match the similarity of the feature library to determine what the text in the picture. There is no use of third-party optical recognition (OCR), because the recognition of Chinese character recognition rate is still relatively poor, and the number of Chinese characters in the verification code is actually not many, is a few specific characters, subtraction and so on. So it is more than enough to identify with the feature library.
About the verification code, I say some of my problems, for grayscale computing and two value, the garden has a lot of algorithms, but for noise reduction, that is, to the interference line, you need to write a specific algorithm according to the target. I'm going to peel it off by peeling it off, giving all the shadows a 1px range and filling it with white each time. Of course, I don't have the versatility of this method. Different verification codes need to be removed in different ways depending on the observation.
Division, that is, the histogram, in fact, my verification code can be based on color to do a monochrome histogram, so that a step to complete the segmentation of characters and noise reduction (there is this idea, but no practical to achieve.) But look at some of Daniel's blog that such a method is feasible. I have learned that the segmentation method and drip division, but I took the paper information, but I do not see very understand. Here's a simple way to draw a histogram:
1 //Draw a histogram2 varZftbit =NewBitmap (BIT4. Width, BIT4. Height);3 using(Graphics g =graphics.fromimage (zftbit))4 {5Pen pen =NewPen (color.blue);6 for(inti =0; I < BIT4. Width; i++)7 {8G.drawline (pen, I, BIT4. Height-yzhifang[i] *2, I, BIT4. Height);9 }Ten //Valve Value OneG.drawline (NewPen (color.red),0, BIT4. Height-2, BIT4. Width, BIT4. Height-2); A } -P_zft. Image = Zftbit;
Draw a histogram
About the character problem of random rotation, my practice is to divide the characters in the verification code into independent units, the positive and negative 30 degrees of rotation, each rotation once, calculate the projection width, because our fonts are basically ' characters ', so, in the rotation, the minimum width is definitely ' Right, however, there is a small problem, that is, if the source character is rotated more than 45 °, we will place the word horizontally, its width is also the smallest. However, we let the machine learn a few more times, the four direction of the graphics are put into the study, you can. This is the Jam method.
2. IP restrictions
Here I use the most rogue is the most non-solution to solve the problem. I directly through the switch access agent to break through, there is no slightest technical content. After the agent is hung, go to the target website and determine if the agent is still valid based on the returned results. If it is not valid, roll back the current query target once and switch the agent on the line.
3. Crawler
The protagonist crawler came, my earliest design crawler is not control the time of continuous access, which led to the agent consumption of particularly fast. So I had to find a way to solve the problem. In addition, because there is no dedicated crawler server, I can only through the office of the computer to complete the task. Thus, I designed a bus-type crawler.
I wrote a reptile server and a crawler client, the server as the central processing Unit, to allocate computational capacity, the client crawler to crawl data. In this case, the speed of each client is actually not the same, the request response is fast and slow, verify that the agent is effective also takes time, all, the client crawler to complete the task of the time certainly not the same, so I arranged such a computer to do as the central processing Unit, batch, small dosage to distribute the task list. and receive the results of the client callbacks, such as the completion of all tasks after the unified export or write to the database and other operations.
Crawler node
On each node of the crawler, to 17 threads to run, 10 to do proxy IP authentication, 7 crawl data. If you install software for 10 Office notebooks and crawl the data together, it will be the equivalent of 70 people per second to access this site. At this point, the efficiency problem has also been solved.
Bus
Bus aspect, the task list according to the following number of nodes allocated (is the previous section of the graph, before the split out, and then found that the client is not the same time to complete, some fast slow, the result is finished quickly, idle, slow still jogging, so, after a small dose allocation, In disguise to achieve dynamic scheduling of task volumes).
Postscript
This is basically the end of the article, the code is not much, I am the main number of production ideas, because my this does not have the versatility, the verification code is not the same as the basic (some extremely simple and orderly pure digital or letter verification code does not count, this type of verification code and not the same).
Simple crawler, breakthrough IP access restrictions and complex verification code, small summary