Shortly after I learned Python, I saw a web crawler project in South China Normal University some time ago. I studied it specially and wrote a web crawler that can be downloaded with multiple threads using Python.
Basic principles of web crawler
What is Web crawler? Persons? Why? Choppy brakes, choppy frames, E, and so on? Kang-ne wall? Why ?? Resistance? Are you sure you want to win the prize ?? Are you sure you want? Just elliptical? Why is there a low Chen in the white market? Fortunately, I was lucky enough to have my mother climb the door? /P>
I. Basic structure and workflow of Web crawlers
The following figure shows a general web crawler framework:
The basic workflow of web crawler is as follows:
1. Select a part of carefully selected seed URLs;
2. Put these URLs into the URL queue to be crawled;
3. Retrieve the URL to be crawled from the URL queue to be crawled, parse the DNS, obtain the host ip address, download the webpage corresponding to the URL, and store it in the downloaded webpage Library. In addition, put these URLs into the captured URL queue.
4. Analyze the URLs in the captured URL queue, analyze other URLs in the queue, and put the URLs in the queue to be crawled URL to enter the next loop.
2. Divide the Internet from the perspective of crawlers
Corresponding, you can divide all the pages on the Internet into five parts:
1. The unexpired webpage has been downloaded.
2. download expired webpage: the captured webpage is actually an image and backup of the Internet content. The Internet is dynamic and some content on the Internet has changed. At this time, this part of the captured web page has expired.
3. Webpage to be downloaded: The pages in the URL queue to be crawled
4. we can see that the webpage has not been captured yet and is not in the URL queue to be crawled. However, you can analyze the obtained URL by analyzing the webpage that has been crawled or the URL to be crawled, the web page is known.
5. Some web pages cannot be directly crawled and downloaded by crawlers. It is called an unknown webpage.
III. Capture Policy
In the crawler System, the URL queue to be crawled is an important part. The sorting of the URLs in the URL queue to be crawled is also an important issue, because it involves grabbing the page first and then grabbing the page. The method that determines the order of these URLs is called the crawling policy. The following describes several common capture policies:
1. Deep priority traversal policy
The depth-first traversal policy means that a web crawler will trace a link from the start page. After processing this line, it will transfer it to the next start page to continue tracking the link. The following figure is used as an example:
Traversal path: A-F-G E-H-I B C D
2. Width-first traversal policy
The basic idea of the width-first traversal policy is to insert the link found on the new download page to the end of the URL queue to be crawled. That is to say, Web crawlers will first crawl all the webpages linked to the starting webpage, then select one of them to continue crawling all the webpages linked to this webpage. Take the preceding figure as an example:
Traversing path: A-B-C-D-E-F G H I
3. Reverse link count policy
The number of reverse links refers to the number of links to a Web page directed by other Web pages. The number of reverse links indicates the degree to which the content of a webpage is recommended by others. Therefore, the crawling system of the search engine often uses this indicator to evaluate the importance of webpages and determine the order in which different webpages are crawled.
In a real network environment, due to the existence of ad links and cheating links, the number of reverse links cannot be completely equal to the importance of others. Therefore, search engines often consider the number of reliable reverse links.
4. Partial PageRank policy
The Partial PageRank algorithm draws on the idea of the PageRank algorithm: For a downloaded webpage, together with the URL in the URL queue to be crawled, a collection of webpages is formed, and the PageRank value of each page is calculated. After calculation, sort the URLs in the URL queue to be crawled by PageRank value and capture the page in this order.
If a page is crawled each time, the PageRank value is re-calculated. One compromise is that after each K pages are captured, the PageRank value is re-calculated. However, there is another problem in this case: there is no PageRank value for the links to be analyzed on the downloaded pages, that is, the part of the unknown Web page we mentioned earlier. To solve this problem, a temporary PageRank value will be given to these pages: the PageRank values transmitted from all the inbound links of this page will be summarized to form the PageRank value of this unknown page, to participate in sorting. The following is an example:
5. OPIC policy and policy
This algorithm is actually used to rate the importance of a page. Before the algorithm starts, give all pages the same initial cash (cash ). After downloading a page P, allocate the cash of P to all links separated from P and clear the cash of P. All pages in the URL queue to be crawled are sorted by the amount of cash.
6. Big site priority strategy
All webpages in the URL queue to be crawled are classified based on their websites. For websites with a large number of pages to be downloaded, download is given priority. This policy is also called a big site priority policy.
The algorithm used by web crawlers to download web pages is the breadth search (BFS). In the evaluation of crawler implementation algorithms on the network, the breadth search algorithm ranks second in the ranking, the best algorithm is to sort by page importance before determining the download order (this algorithm is flexible and I am not familiar with how to sort it ).
Go to the question and describe how to implement it:
After getting a solution that has been described, we can describe the big steps first based on the top-down idea, and then split them into small problems to solve them in part.
For a Web crawler, if you want to download it in a wide traversal mode, it works like this:
1. Download the first webpage from the given Portal URL
2. Extract all new webpage addresses from the first webpage and put them in the download list.
3. Download all new web pages from the download list
4. Find the undownloaded webpage address from all new webpages and update the download list.
5. Repeat steps 3 and 4 until the updated download list is empty.
In fact, it is simplified to the following steps:
1. Download by download list
2. Update the download list
3. Loop operation 1 and 2 until the list is empty
So the original idea was to write a function to do this:
Def craw ():
While len (urlList )! = 0
Init_url_list ()
Download_list ()
Update_list ()
Of course, the above function cannot work, but it is just a top-level idea, and the underlying implementation has not yet been done. But this step is very important, at least let yourself know what to do.
The following is to implement each part of the function. This can be implemented in a class. I name it WebCrawler.
In python, it is not difficult to download a webpage from one address. You can use urlopen in urllib to connect to a webpage, then, call the read method of the obtained object to obtain the string of the webpage content, as shown in the following code:
IDLE 2.6.6 ==== No Subprocess ====
>>> Import urllib
>>> F = urllib. urlopen ('http: // www.hfut.edu.cn ')
>>> S = f. read ()
>>>
In this way, the above variable s is the content of the webpage obtained from the address of the http://www.hfut.edu.cn, which is the str data type. Now you can use it all. Write a file or extract a new address from it. Of course, you only need to write the file, even if the page is downloaded.
The speed of downloading a crawler is definitely a very important issue. No one wants to use a single-threaded crawler to download at the speed of only one web page at a time. On the campus network of my school, tested single-threaded crawlers, with an average of 1 k per second. Therefore, the solution is to use multiple threads. It is faster to open several more connections and download them at the same time. I am a beginner in Python and use things temporarily.
The download Thread uses another class named "crawler Thread", which inherits the threading. Thread class.
Because it involves updating the download list, the thread needs to consider synchronization for reading and writing a table. I used a thread Lock in the code. This uses threading. Lock () to construct an object. The acquire () and release () of the called object ensure that only one thread operates the table at a time. Of course, in order to ensure that table updates can be implemented, I have used multiple tables and one table cannot be done. Because you need to know the network address you want to download and the network address you have already downloaded. You need to remove the downloaded address from the list of URLs obtained from the new Web page, which involves some temporary tables.
When a crawler downloads a web page, it is best to record the file on which the web page is stored, and record the depth of the layer at which the web page is searched for wide search, if you want to use a search engine, this is a reference for indexing and sorting webpages. At least you will want to know what crawlers have downloaded to you and where they are stored. The corresponding write record statement is marked with # at the end of the line in the code.
I have written a lot of text and don't want to write any more. I can directly paste the code:
The content of file Test. py is as follows: (it calls WebCrawler and runs it at runtime)
--------------------------------------------------------
The code is as follows: |
Copy code |
#-*-Coding: cp936 -*- Import WebCrawler
Url = raw_input ('set the entry url (for example --> http://www.baidu.com): N ') ThNumber = int (raw_input ('number of threads set: ') # The previous type is not converted to a bug.
Wc = WebCrawler. WebCrawler (thNumber) Wc. Craw (url) |
The content of the WebCrawler. py file is as follows:
--------------------------------------------------------
The code is as follows: |
Copy code |
#-*-Coding: cp936 -*- Import threading Import GetUrl Import urllib
G_mutex = threading. Lock () G_pages = [] # after the thread downloads the page, add the page content to this list. G_dledUrl = [] # all downloaded URLs G_toDlUrl = [] # the url to be downloaded G_failedUrl = [] # Download failed url G_totalcount = 0 # Number of downloaded pages
Class WebCrawler: Def _ init _ (self, threadNumber ): Self. threadNumber = threadNumber Self. threadPool = [] Self. logfile = file('your log.txt ', 'w ')##
Def download (self, url, fileName ): Cth = crawler thread (url, fileName) Self. threadPool. append (Cth) Cth. start ()
Def downloadAll (self ): Global g_toDlUrl Global g_totalcount I = 0 While I <len (g_toDlUrl ): J = 0 While j <self. threadNumber and I + j <len (g_toDlUrl ): G_totalcount + = 1 # enter the loop to download the page Plus 1 Self.download(g_todlurlw.iw.j.pdf, str(g_totalcount{'.htm ') Print 'thread started: ', I + j,' -- File number = ', g_totalcount J + = 1 I + = j For th in self. threadPool: Th. join (30) # wait for the thread to end and time out in 30 seconds Self. threadPool = [] # clear the thread pool G_toDlUrl = [] # clear the list
Def updateToDl (self ): Global g_toDlUrl Global g_dledUrl NewUrlList = [] For s in g_pages: NewUrlList + = GetUrl. GetUrl (s) ####### specific implementation of GetUrl G_toDlUrl = list (set (newUrlList)-set (g_dledUrl) # The unhashable prompt is displayed.
Def Craw (self, entryUrl): # This is a deep search. It ends when g_toDlUrl is null. G_toDlUrl.append (entryUrl) Depth = 0 While len (g_toDlUrl )! = 0: Depth + = 1 Print 'Searching depth ', depth,'... nn' Self. downloadAll () Self. updateToDl () Content = 'n' >>> Depth '+ str (depth) +': n' ##( this mark indicates that this statement is used to write File Records) Self. logfile. write (content )## I = 0 ## While I <len (g_toDlUrl ):## Content = str (g_totalcount + I) + '->' + g_toDlUrl [I] + 'n '## Self. logfile. write (content )## I + = 1 ##
Class crawler Thread (threading. Thread ): Def _ init _ (self, url, fileName ): Threading. Thread. _ init _ (self) Self. url = url # The url downloaded by this thread Self. fileName = fileName
Def run (self): # thread work --> Download the html page Global g_mutex Global g_failedUrl Global g_dledUrl Try: F = urllib. urlopen (self. url) S = f. read () Fout = file (self. fileName, 'w ') Fout. write (s) Fout. close () Except t: G_mutex.acquire () # thread lock --> Lock G_dledUrl.append (self. url) G_failedUrl.append (self. url) G_mutex.release () # thread lock --> release Print 'failed' downloading and saving', self. url Return None # remember to return!
G_mutex.acquire () # thread lock --> Lock G_pages.append (s) G_dledUrl.append (self. url) G_mutex.release () # thread lock --> release |
File GetUrl. the content of py is as follows: (GetUrl in it obtains all URLs from a string containing webpage content and returns them as a list. There are many implementation methods, so you can write a better one by yourself)
--------------------------------------------------------
The code is as follows: |
Copy code |
UrlSep = ['<', '>', '', '(', ')', R'" ', '', 'T', 'n'] UrlTag = ['http: // ']
Def is_sep (ch ): For c in urlSep: If c = ch: Return True Return False
Def find_first_sep (I, s ): While I <len (s ): If is_sep (s [I]): Return I I + = 1 Return len (s)
Def GetUrl (strPage ): RtList = [] For tag in urlTag: I = 0 I = strPage. find (tag, I, len (strPage )) While I! =-1: Begin = I End = find_first_sep (begin + len (tag), strPage) RtList. append (strPage [begin: end]) I = strPage. find (tag, end, len (strPage )) |