Use Python to compile the basic modules and framework Usage Guide for crawlers, and use guide for python

Source: Internet
Author: User

Use Python to compile the basic modules and framework Usage Guide for crawlers, and use guide for python

Basic module
Python crawler, web spider. Crawls the website to obtain webpage data and analyzes and extracts the data.

The basic module uses urllib, urllib2, re, and other modules.

Basic usage, for example:

(1) Perform basic GET requests to obtain html

#! Coding = utf-8import urllibimport urllib2 url = 'HTTP: // response GET request = urllib2.Request (url) try: # Return response = urllib2.urlopen (request) response t urllib2.HTTPError, e: if hasattr (e, 'reason '): print e. reason # Read response's bodyhtml = response. read () # read response headersheaders = response.info ()


(2) Form submission

#!coding=utf-8import urllib2import urllib post_url = '' post_data = urllib.urlencode({  'username': 'username',  'password': 'password',}) post_headers = {  'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:31.0) Gecko/20100101 Firefox/31.0',} request = urllib2.Request(  url=post_url,  data=post_data,  headers=post_headers,) response = urllib2.urlopen(request) html = response.read()

(3)

#!coding=utf-8 import urllib2import re page_num = 1url = 'http://tieba.baidu.com/p/3238280985?see_lz=1&pn='+str(page_num)myPage = urllib2.urlopen(url).read().decode('gbk') myRe = re.compile(r'class="d_post_content j_d_post_content ">(.*?)</div>', re.DOTALL)items = myRe.findall(myPage) f = open('baidu.txt', 'a+') import sysreload(sys)sys.setdefaultencoding('utf-8') i = 0texts = []for item in items:  i += 1  print i  text = item.replace('<br>', '')  text.replace('\n', '').replace(' ', '') + '\n'  print text  f.write(text) f.close()

(4)

# Coding: UTF-8 ''' simulate login to the 163 mailbox and download the mail content ''' import urllibimport urllib2import cookielibimport reimport timeimport json class Email163: header = {'user-agent ': 'mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: 1.9.1.6) gecko/20091201 Firefox/3.5.6 '} user = ''cookie = None sid = None mailBaseUrl = 'HTTP: // twebmail.mail.163.com' def _ init _ (self): self. cookie = cookielib. cookieJar () cookiePro = urllib2.HTTP CookieProcessor (self. cookie) urllib2.install _ opener (urllib2.build _ opener (cookiePro) def login (self, user, pwd): ''' log on ''' postdata = urllib. urlencode ({'username': user, 'Password': pwd, 'type': 1}) # note that the logon URL is different from req = urllib2.Request (url = 'https: // ssl.mail.163.com/entry/coremail/fcg/ntesdoor2? Funcid = loginone & language =-1 & passtype = 1 & iframe = 1 & product = mail163 & from = web & df = email163 & race =-2_45_-2 _ hz & module = & uid = '+ user +' & style = 10 & net = t & skinid = null ', data = postdata, headers = self. header,) res = str (urllib2.urlopen (req ). read () # print res patt = re. compile ('sid = ([^ "] +) ', re. i) patt = patt. search (res) uname = user. split ('@') [0] self. user = user if patt: self. sid = patt. group (1 ). strip () # print self. s Id print '% s Login Successful ..... '% (uname) else: print' % s Login failed .... '% (uname) def getInBox (self): ''' get the mailbox list ''' print' \ nGet mail lists ..... \ n' sid = self. sid url = self. mailBaseUrl + '/jy3/list. do? Sid = '+ sid +' & fid = 1 & fr = folder 'res = urllib2.urlopen (url ). read () # Get The Mail List mailList = [] patt = re. compile ('<div \ s + class = "tdLike Ibx_Td_From" [^>] +>. *? Href = "([^"] +) "[^>] +> (.*?) <\/A> .*? <Div \ s + class = "tdLike Ibx_Td_Subject" [^>] +> .*? Href = "[^>] +> (.*?) <\/A> ', re. I | re. s) patt = patt. findall (res) if patt = None: return mailList for I in patt: line = {'from': I [1]. decode ('utf8'), 'url': self. mailBaseUrl + I [0], 'subobject': I [2]. decode ('utf8')} mailList. append (line) return mailList def getMailMsg (self, url): ''' Download email content ''' content = ''print '\ n Download ..... % s \ n' % (url) res = urllib2.urlopen (url ). read () patt = re. compile ('contenturl: "([^"] +) "', re. i) patt = patt. search (res) if patt = None: return content url = '% s % s' % (self. mailBaseUrl, patt. group (1) time. sleep (1) res = urllib2.urlopen (url ). read () Djson = json. JSONDecoder (encoding = 'utf8') jsonRes = Djson. decode (res) if 'resulvar' in jsonRes: content = Djson. decode (res) ['resulvar'] time. sleep (3) return content '''demo''' # initialize mail163 = Email163 () # log on to mail163.login ('lpe234 @ 163.com ', '123') time. sleep (2) # Get inbox elist = mail163.getInBox () # get mail content for I in elist: print 'subject: % s from: % s content: \ n % s' % (I ['subobject']. encode ('utf8'), I ['from']. encode ('utf8'), mail163.getMailMsg (I ['url']). encode ('utf8 '))

(5) login required

#1. import urllib2, cookielibcookie_support = urllib2.HTTPCookieProcessor (cookielib. cookieJar () opener = urllib2.build _ opener (cookie_support, urllib2.HTTPHandler) urllib2.install _ opener (opener) content = urllib2.urlopen ('HTTP: // XXX '). read () #2 use proxy and cookie opener = urllib2.build _ opener (proxy_support, cookie_support, urllib2.HTTPHandler) #3 process the form import urllibpostdata = urllib. urlencode ({'username': 'xxxxx', 'Password': 'xxxxx', 'continuuris ': 'http: // www.verycd.com/', 'fk ': fk, 'login _ submit ': 'login'}) req = urllib2.Request (url = 'HTTP: // response, data = postdata) result = urllib2.urlopen (req ). read () #4 disguise as browser access headers = {'user-agent': 'mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: 1.9.1.6) gecko/20091201 Firefox/3.5.6 '} req = urllib2.Request (url = 'HTTP: // secure.verycd.com/signin/#/http://www.verycd.com/', data = postdata, headers = headers) #5 anti-leeching headers = {'Referer': 'http: // www.cnbeta.com/articles '}

(6) Multithreading

From threading import Threadfrom Queue import Queuefrom time import sleep # q is the task Queue # NUM is the total number of concurrent threads # How many JOBS are there? q = Queue () NUM = 2 JOBS = 10 # specific processing function, responsible for processing a single task def do_somthing_using (arguments): print arguments # This is a working process, gets data from the queue and processes def working (): while True: arguments = q. get () do_somthing_using (arguments) sleep (1) q. task_done () # fork NUM threads waiting queue for I in range (NUM): t = Thread (target = working) t. setDaemon (True) t. start () # queue JOBS into the queue for I in range (JOBS): q. put (I) # Wait for all JOBS to complete q. join ()

Scrapy framework
Scrapy framework, a fast, high-level screen capture and web capture framework developed by Python, is used to capture web sites and extract structured data from pages. Scrapy is widely used for data mining, monitoring, and automated testing.

At the beginning, I learned this framework. Not very good comment. I just feel that this framework is somewhat Java and requires support from too many other modules.

(1) create a scrapy Project

# Use scrapy startproject scrapy_test ├ ── scrapy_test │ ├ ── scrapy. cfg │ └ ── scrapy_test │ ── _ init __. py │ ── items. py │ ── pipelines. py │ ── settings. py │ ── spiders │ ── _ init __. py # create a scrapy Project

(2) Description

Scrapy. cfg: project configuration file
Items. py: data structure definition file to be extracted
Pipelines. py: Pipeline definition, which is used to further process the data extracted from items, such as storage.
Settings. py: crawler configuration file
Spiders: directory where spider is stored
(3) dependent packages

Dependent packages are troublesome.

# Python-dev package installation apt-get install python-dev # twisted, w3lib, six, queuelib, cssselect, libxslt pip install w3libpip install twistedpip install lxmlapt-get install libxml2-dev libxslt-dev apt-get install python-lxmlpip install cssselect pip install pyOpenSSL sudo pip install service_identity # After installation, you can use scrapy startproject test to create a project.

(4) Capture instances.
(1) create a scrapy Project

Digoal @ digoal-pc :~ /Python/spit $ scrapy startproject itzhaopinNew Scrapy project 'itzhaopin' created in:/home/digoal/Python/spit/itzhaopin You can start your first spider: cd itzhaopin scrapy genspider example. comdi#@ dimo-- pc :~ /Python/spit $ digoal @ digoal-pc :~ /Python/spit $ cd itzhaopindi#@ dimo-- pc :~ /Python/spit/itzhaopin $ tree. ── itzhaopin │ ── _ init __. py │ ── items. py │ ── pipelines. py │ ── settings. py │ ── spiders │ ── _ init __. py ── scrapy. cfg # scrapy. cfg: item http://my.oschina.net/lpe234/admin/new-blogfile # items. py: the data structure definition file to be extracted # pipelines. py: Pipeline definition, used to further process the data extracted from items, such as saving # settings. py: crawler configuration file # spiders: directory where spider is placed

(2) define the data structure items. py to be crawled

From scrapy. item import Item, Field # define the data class TencentItem (Item): name = Field () # job name catalog = Field () # job category workLocation = Field () # job location recruitNumber = Field () # recruitment staff detailLink = Field () # job details link publishTime = Field () # release time

(3) implement the Spider class

  • Spider is a Python class inherited from scarpy. contrib. spiders. CrawlSpider and has three required members.
  • Name: name, which is the spider identifier.
  • Start_urls: a url list where spider crawls from these webpages
  • Parse (): A method. After the webpage in start_urls is captured, you need to call this method to parse the webpage content. At the same time, you need to return the next webpage to be crawled or the items list.

Create a spider under the spiders directory, tencent_spider.py:

#coding=utf-8 from scrapy.spider import BaseSpider  class DmozSpider(BaseSpider):  name = 'dmoz'  allowed_domains = ['dmoz.org']  start_urls = [    'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',    'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/'  ]   def parse(self, response):    filename = response.url.split('/')[-2]    open(filename, 'wb').write(response.info)

This is simpler. Use scrapy crawl dmoz # To Run spider

Articles you may be interested in:
  • In Python, The urllib + urllib2 + cookielib module write crawler practices
  • In-depth analysis of the structure and operation process of the Python crawler framework Scrapy
  • Practice the Python crawler framework Scrapy to capture the TOP250 Douban movie
  • Use regular expressions to search for Python crawler package Beautiful Soup
  • Some key points of using the Beautiful Soup package to write crawlers in Python
  • Create a crawler to capture beautiful pictures in Python
  • How to Write a Python crawler to capture TOP100 Douban movies and user portraits
  • Demonstrate the usage of the Python crawler Beautiful Soup using video crawling instances
  • Tutorial on creating crawler instances using Python's urllib and urllib2 modules

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.