Introduction to Python web crawler

Source: Internet
Author: User
This article mainly introduces the basic method of the Python web crawler function, web crawler, that is, WebSpider, which is a very vivid name. Comparing the Internet to a Spider, a Spider is a Spider crawling online, if you are interested in Web crawlers, refer to this article, which describes the basic method of the Python Web crawler function. Web crawlers are a very vivid name. Comparing the Internet to a Spider, a Spider is a web crawler. if you are interested in web crawlers, refer to this article.

Web crawlerIs an image name. Comparing the Internet to a Spider, a Spider is a web crawler.

1. web crawler definition

Web crawlers search for web pages based on their link addresses. Starting from a website page (usually the homepage), read the content of the webpage, find other link addresses in the webpage, and then find the next webpage through these link addresses. This keeps repeating, until all the web pages of the website are crawled. If the whole internet is regarded as a website, the web spider can use this principle to capture all the web pages on the Internet. In this case, Web crawler is a crawling program and a crawling program. The basic operation of Web crawlers is to capture webpages.

2. web page browsing process

In fact, the process of capturing a webpage is the same as that of browsing a webpage through the IE browser. For example, enter www.baidu.com in the address bar of your browser.

The process of opening a webpage is actually that the browser acts as a browser "client" and sends a request to the server, "grabbing" the server file, and then explaining and presenting it.

HTML is a markup language that uses tags to tag content and parse and distinguish it. The function of the browser is to parse the obtained HTML code, and then convert the original code into a website page that we can directly see.

3. python-based web crawler

1). obtain the html page using python

In fact, the most basic website capture can be done in two sentences:


import urllib2content = urllib2.urlopen('http://XXXX').read()

In this way, we can obtain the entire html document. The key issue is that we may need to obtain the useful information from this document, rather than the entire document. This requires parsing the html that is filled with various tags.

2) parse the html method after the python crawler crawls the page

Python crawler html parser library SGMLParser

Python comes with HTMLParser, SGMLParser, and so on by default. The former is too difficult to use, so I wrote a sample program using SGMLParser:


import urllib2from sgmllib import SGMLParser class ListName(SGMLParser):def init(self):SGMLParser.init(self)self.is_h4 = ""self.name = []def start_h4(self, attrs):self.is_h4 = 1def end_h4(self):self.is_h4 = ""def handle_data(self, text):if self.is_h4 == 1:self.name.append(text) content = urllib2.urlopen('http://169it.com/xxx.htm').read()listname = ListName()listname.feed(content)for item in listname.name:print item.decode('gbk').encode('utf8')

A class called ListName is defined here to inherit the methods in SGMLParser. Use the is_h4 variable to Mark and determine the h4 tag in the html file. if the h4 tag is encountered, add the content in the tag to the List variable name. The start_h4 () and end_h4 () functions are prototype:


start_tagname(self, attrs)end_tagname(self)

Tagname is the tag name.

Start_pre will be called.
End_pre is called. Attrs is a tag parameter and is returned in the form of [(attribute, value), (attribute, value.

Python crawler html Library pyQuery

PyQuery is the implementation of jQuery in python. it is very convenient to parse HTML documents using jQuery syntax. Install easy_install pyquery or


sudo apt-get install python-pyquery

Example:


from pyquery import PyQuery as pyqdoc=pyq(url=r'http://169it.com/xxx.html')cts=doc('.market-cat') for i in cts:print '====',pyq(i).find('h4').text() ,'===='for j in pyq(i).find('.sub'):print pyq(j).text() ,print '\n'

Python crawler html Library BeautifulSoup

One of the headaches is that most web pages do not fully comply with the standards, and all sorts of inexplicable errors make it hard for the person who wants to find the webpage. To solve this problem, we can select the famous BeautifulSoup to parse html documents, which has good fault tolerance capabilities.

The above is all the content in this article. I have analyzed and introduced the implementation of the Python web crawler function in detail and hope it will help you learn it.

The above is a detailed description of the basic writing of the Python web crawler function. For more information, see other related articles in the first PHP community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.