Zero-basic writing of python crawlers crawling Baidu post bar code sharing, python Crawlers
I will not talk nonsense here. I will directly discuss the code and explain the code in the comments. Don't ask me if you don't understand it. Learn the basic knowledge!
Copy codeThe Code is as follows:
#-*-Coding: UTF-8 -*-
#---------------------------------------
# Program: Baidu Post Bar Crawler
# Version 0.1
# Author: why
# Date: 2013-05-14
# Programming language: Python 2.7
# Operation: enter a paging address, remove the last number, and set the start and end pages.
# Function: Download all pages on the corresponding page and store them as html files.
#---------------------------------------
Import string, urllib2
# Defining Baidu Functions
Def baidu_tieba (url, begin_page, end_page ):
For I in range (begin_page, end_page + 1 ):
SName = string. zfill (I, 5) + '.html '# automatically fill in the six-digit file name
Print 'Download' + str (I) + 'web page and store it as' + sName + '......'
F = open (sName, 'W + ')
M = urllib2.urlopen (url + str (I). read ()
F. write (m)
F. close ()
# -------- Enter the parameter ---------------- here ------------------
# This is the address of a post in Baidu Post Bar of Shandong University.
# Bdurl = 'HTTP: // tieba.baidu.com/p/2296017831? Pn ='
# IPostBegin = 1
# IPostEnd = 10
Bdurl = str (raw_input (U' enter the address of the post, remove the number "\ n" after pn = '))
Begin_page = int (raw_input (U' enter the start page number: \ n '))
End_page = int (raw_input (U' the number of pages at the end: \ n '))
# -------- Enter the parameter ---------------- here ------------------
# Call
Baidu_tieba (bdurl, begin_page, end_page)
The above is a simple piece of code that python crawls Baidu post bar. It is very practical. You can expand it on your own.
How does a crawler capture the id and id content in the source code of a static webpage using the regular expression in python?
I only saw the ID and did not see the ID. Where is it?
If the ID is extracted, the regular expression is ID-\ d +.
In the scrapy framework, how does one use python to automatically redirect a crawler to a page to capture webpage content?
Crawlers can track the next page by simulating the next page connection and then sending new requests. See:
Item1 = Item () yield item1item2 = Item () yield item2req = Request (url = 'Next page link', callback = self. parse) yield req
Do not use the return statement when using yield.