Implementation of Python's simple crawler

Source: Internet
Author: User

Python is powerful in a variety of fully functional modules. Reasonable use can omit many details of entanglement, improve the development efficiency.

Using Python to achieve a more complete crawler, but just a few dozens of lines of code, but think of how complicated it is to use the bottom C to achieve this, light a Web page data acquisition requires bytes to build a packet with the original socket, and then parse the data packet, the analysis of the Web page data, but also to drink a pot.

Below is a detailed analysis of how Python constructs a crawler.

0x01 Simple crawler main function module

URL Manager: manages the collection of crawled URLs and crawled URLs, preventing repeated crawls and preventing loops from crawling. The main need is to implement: Add a new URL to the collection to be crawled, determine whether the URL to be added is in the container, determine whether the URL has yet to be crawled, get a crawl URL, move the URL from the crawl to the crawl. The URL implementation can take the memory set () collection, the relational database, the cache database. In general, small reptile data is sufficient to save memory.

Web Downloader: HTML Web page data is saved as a text file or a memory string via a URL. The URLLLIB2 module and the requests module are provided in Python for this function. The specific code implementation below is detailed analysis.

Web parser: get the new URL and the data you care about by getting the HTML document.  How do you get the information you need from an HTML document? You can analyze the structure of the information and then get it through the python regular expression fuzzy match, but this approach is a bit of a challenge when confronted with complex HTML. can be resolved through Python's own html.parser, or through a third-party module beautiful Soup, lxml, etc. to structure the parsing.  What is structured parsing? is to put the structure of the Web page as a tree structure, the official name is Dom (Document Object Model).

The node data is then obtained by searching the node.

Run Process: The scheduler asks if the URL has a crawl URL, if any, get one, and then send to the downloader to get the HTML content, and then send the content to the parser to parse, get the new URL and the concerned data, and then put the new added URL into the URL manager.

Use of the 0X02 URLLIB2 module

There are many ways to use URLLIB2.

The first type:

Get HTML directly from the Urlopen way.

" http://www.baidu.com " Print ' The first method '  = urllib2.urlopen (URL)print  response1.getcode ()print len ( Response1.read ())

The second type:

This method is to build the HTTP request header itself, disguised as a browser, you can bypass some anti-crawling mechanism, the construction of the HTTP request header more flexible.

 url =  http://www.baidu.com   " print  "    " request  = Urllib2. Request (URL) request.add_header (   , "  mozilla/5.0   ) Response2  = Urllib2.urlopen (request)  print   Response2.getcode ()  print  Len (Response2.read ()) 

The third type:

Add cookie processing to get the page information you need to sign in.

" http://www.baidu.com " Print ' The third method '      = = = Urllib2.urlopen (URL) print response3.getcode () Print CJ print len (Response3.read ())

Of course, the use of these methods need to import URLLIB2, the third need to import cookielib.

The realization of 0X03 BeautifulSoup

The following is a brief talk about the usage of beautifulsoup . It's basically three-step: Create BeautifulSoup objects, find nodes, and get node content.

 fromBs4ImportBeautifulSoupImportRehtml_doc=""""""Soup= BeautifulSoup (Html_doc,'Html.parser', from_encoding='Utf-8')Print 'Get All Links'links= Soup.find_all ('a') forLinkinchLinks:Printlink.name,link['href'],link.get_text ()Print 'Get Lacie Link'Link_node= Soup.find ('a', href='Http://example.com/lacie')Printlink_node.name,link_node['href'],link_node.get_text ()Print 'Match'Link_node= Soup.find ('a', Href=re.compile (R'ill'))PrintLink_node.name, link_node['href'], Link_node.get_text ()Print 'P'P_node= Soup.find ('P', class_="title")PrintP_node.name, P_node.get_text ()

A simple implementation of the 0X04 crawler

No more cumbersome here, the specific code has been uploaded to Github:github.com/zibility/spider

Implementation of Python's simple crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.