Zero-basic writing of python crawlers crawling Baidu post bar code sharing, python Crawlers

Source: Internet
Author: User

Zero-basic writing of python crawlers crawling Baidu post bar code sharing, python Crawlers

I will not talk nonsense here. I will directly discuss the code and explain the code in the comments. Don't ask me if you don't understand it. Learn the basic knowledge!

Copy codeThe Code is as follows:
#-*-Coding: UTF-8 -*-
#---------------------------------------
# Program: Baidu Post Bar Crawler
# Version 0.1
# Author: why
# Date: 2013-05-14
# Programming language: Python 2.7
# Operation: enter a paging address, remove the last number, and set the start and end pages.
# Function: Download all pages on the corresponding page and store them as html files.
#---------------------------------------
Import string, urllib2
# Defining Baidu Functions
Def baidu_tieba (url, begin_page, end_page ):
For I in range (begin_page, end_page + 1 ):
SName = string. zfill (I, 5) + '.html '# automatically fill in the six-digit file name
Print 'Download' + str (I) + 'web page and store it as' + sName + '......'
F = open (sName, 'W + ')
M = urllib2.urlopen (url + str (I). read ()
F. write (m)
F. close ()
# -------- Enter the parameter ---------------- here ------------------
# This is the address of a post in Baidu Post Bar of Shandong University.
# Bdurl = 'HTTP: // tieba.baidu.com/p/2296017831? Pn ='
# IPostBegin = 1
# IPostEnd = 10

Bdurl = str (raw_input (U' enter the address of the post, remove the number "\ n" after pn = '))
Begin_page = int (raw_input (U' enter the start page number: \ n '))
End_page = int (raw_input (U' the number of pages at the end: \ n '))
# -------- Enter the parameter ---------------- here ------------------
# Call
Baidu_tieba (bdurl, begin_page, end_page)

The above is a simple piece of code that python crawls Baidu post bar. It is very practical. You can expand it on your own.


How does a crawler capture the id and id content in the source code of a static webpage using the regular expression in python?

I only saw the ID and did not see the ID. Where is it?
If the ID is extracted, the regular expression is ID-\ d +.

In the scrapy framework, how does one use python to automatically redirect a crawler to a page to capture webpage content?

Crawlers can track the next page by simulating the next page connection and then sending new requests. See:
Item1 = Item () yield item1item2 = Item () yield item2req = Request (url = 'Next page link', callback = self. parse) yield req
Do not use the return statement when using yield.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.