Use Python to compile the basic modules and framework Usage Guide for crawlers, and use guide for python

Last Update:2016-01-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Basic module
Python crawler, web spider. Crawls the website to obtain webpage data and analyzes and extracts the data.

The basic module uses urllib, urllib2, re, and other modules.

Basic usage, for example:

(1) Perform basic GET requests to obtain html

#! Coding = utf-8import urllibimport urllib2 url = 'HTTP: // response GET request = urllib2.Request (url) try: # Return response = urllib2.urlopen (request) response t urllib2.HTTPError, e: if hasattr (e, 'reason '): print e. reason # Read response's bodyhtml = response. read () # read response headersheaders = response.info ()

(2) Form submission

#!coding=utf-8import urllib2import urllib post_url = '' post_data = urllib.urlencode({  'username': 'username',  'password': 'password',}) post_headers = {  'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:31.0) Gecko/20100101 Firefox/31.0',} request = urllib2.Request(  url=post_url,  data=post_data,  headers=post_headers,) response = urllib2.urlopen(request) html = response.read()

(3)

#!coding=utf-8 import urllib2import re page_num = 1url = 'http://tieba.baidu.com/p/3238280985?see_lz=1&pn='+str(page_num)myPage = urllib2.urlopen(url).read().decode('gbk') myRe = re.compile(r'class="d_post_content j_d_post_content ">(.*?)</div>', re.DOTALL)items = myRe.findall(myPage) f = open('baidu.txt', 'a+') import sysreload(sys)sys.setdefaultencoding('utf-8') i = 0texts = []for item in items:  i += 1  print i  text = item.replace('<br>', '')  text.replace('\n', '').replace(' ', '') + '\n'  print text  f.write(text) f.close()

(4)

# Coding: UTF-8 ''' simulate login to the 163 mailbox and download the mail content ''' import urllibimport urllib2import cookielibimport reimport timeimport json class Email163: header = {'user-agent ': 'mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: 1.9.1.6) gecko/20091201 Firefox/3.5.6 '} user = ''cookie = None sid = None mailBaseUrl = 'HTTP: // twebmail.mail.163.com' def _ init _ (self): self. cookie = cookielib. cookieJar () cookiePro = urllib2.HTTP CookieProcessor (self. cookie) urllib2.install _ opener (urllib2.build _ opener (cookiePro) def login (self, user, pwd): ''' log on ''' postdata = urllib. urlencode ({'username': user, 'Password': pwd, 'type': 1}) # note that the logon URL is different from req = urllib2.Request (url = 'https: // ssl.mail.163.com/entry/coremail/fcg/ntesdoor2? Funcid = loginone & language =-1 & passtype = 1 & iframe = 1 & product = mail163 & from = web & df = email163 & race =-2_45_-2 _ hz & module = & uid = '+ user +' & style = 10 & net = t & skinid = null ', data = postdata, headers = self. header,) res = str (urllib2.urlopen (req ). read () # print res patt = re. compile ('sid = ([^ "] +) ', re. i) patt = patt. search (res) uname = user. split ('@') [0] self. user = user if patt: self. sid = patt. group (1 ). strip () # print self. s Id print '% s Login Successful ..... '% (uname) else: print' % s Login failed .... '% (uname) def getInBox (self): ''' get the mailbox list ''' print' \ nGet mail lists ..... \ n' sid = self. sid url = self. mailBaseUrl + '/jy3/list. do? Sid = '+ sid +' & fid = 1 & fr = folder 'res = urllib2.urlopen (url ). read () # Get The Mail List mailList = [] patt = re. compile ('<div \ s + class = "tdLike Ibx_Td_From" [^>] +>. *? Href = "([^"] +) "[^>] +> (.*?) <\/A> .*? <Div \ s + class = "tdLike Ibx_Td_Subject" [^>] +> .*? Href = "[^>] +> (.*?) <\/A> ', re. I | re. s) patt = patt. findall (res) if patt = None: return mailList for I in patt: line = {'from': I [1]. decode ('utf8'), 'url': self. mailBaseUrl + I [0], 'subobject': I [2]. decode ('utf8')} mailList. append (line) return mailList def getMailMsg (self, url): ''' Download email content ''' content = ''print '\ n Download ..... % s \ n' % (url) res = urllib2.urlopen (url ). read () patt = re. compile ('contenturl: "([^"] +) "', re. i) patt = patt. search (res) if patt = None: return content url = '% s % s' % (self. mailBaseUrl, patt. group (1) time. sleep (1) res = urllib2.urlopen (url ). read () Djson = json. JSONDecoder (encoding = 'utf8') jsonRes = Djson. decode (res) if 'resulvar' in jsonRes: content = Djson. decode (res) ['resulvar'] time. sleep (3) return content '''demo''' # initialize mail163 = Email163 () # log on to mail163.login ('lpe234 @ 163.com ', '123') time. sleep (2) # Get inbox elist = mail163.getInBox () # get mail content for I in elist: print 'subject: % s from: % s content: \ n % s' % (I ['subobject']. encode ('utf8'), I ['from']. encode ('utf8'), mail163.getMailMsg (I ['url']). encode ('utf8 '))

(5) login required

#1. import urllib2, cookielibcookie_support = urllib2.HTTPCookieProcessor (cookielib. cookieJar () opener = urllib2.build _ opener (cookie_support, urllib2.HTTPHandler) urllib2.install _ opener (opener) content = urllib2.urlopen ('HTTP: // XXX '). read () #2 use proxy and cookie opener = urllib2.build _ opener (proxy_support, cookie_support, urllib2.HTTPHandler) #3 process the form import urllibpostdata = urllib. urlencode ({'username': 'xxxxx', 'Password': 'xxxxx', 'continuuris ': 'http: // www.verycd.com/', 'fk ': fk, 'login _ submit ': 'login'}) req = urllib2.Request (url = 'HTTP: // response, data = postdata) result = urllib2.urlopen (req ). read () #4 disguise as browser access headers = {'user-agent': 'mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: 1.9.1.6) gecko/20091201 Firefox/3.5.6 '} req = urllib2.Request (url = 'HTTP: // secure.verycd.com/signin/#/http://www.verycd.com/', data = postdata, headers = headers) #5 anti-leeching headers = {'Referer': 'http: // www.cnbeta.com/articles '}

(6) Multithreading

From threading import Threadfrom Queue import Queuefrom time import sleep # q is the task Queue # NUM is the total number of concurrent threads # How many JOBS are there? q = Queue () NUM = 2 JOBS = 10 # specific processing function, responsible for processing a single task def do_somthing_using (arguments): print arguments # This is a working process, gets data from the queue and processes def working (): while True: arguments = q. get () do_somthing_using (arguments) sleep (1) q. task_done () # fork NUM threads waiting queue for I in range (NUM): t = Thread (target = working) t. setDaemon (True) t. start () # queue JOBS into the queue for I in range (JOBS): q. put (I) # Wait for all JOBS to complete q. join ()

Scrapy framework
Scrapy framework, a fast, high-level screen capture and web capture framework developed by Python, is used to capture web sites and extract structured data from pages. Scrapy is widely used for data mining, monitoring, and automated testing.

At the beginning, I learned this framework. Not very good comment. I just feel that this framework is somewhat Java and requires support from too many other modules.

(1) create a scrapy Project

# Use scrapy startproject scrapy_test ├ ── scrapy_test │ ├ ── scrapy. cfg │ └ ── scrapy_test │ ── _ init __. py │ ── items. py │ ── pipelines. py │ ── settings. py │ ── spiders │ ── _ init __. py # create a scrapy Project

(2) Description

Scrapy. cfg: project configuration file
Items. py: data structure definition file to be extracted
Pipelines. py: Pipeline definition, which is used to further process the data extracted from items, such as storage.
Settings. py: crawler configuration file
Spiders: directory where spider is stored
(3) dependent packages

Dependent packages are troublesome.

# Python-dev package installation apt-get install python-dev # twisted, w3lib, six, queuelib, cssselect, libxslt pip install w3libpip install twistedpip install lxmlapt-get install libxml2-dev libxslt-dev apt-get install python-lxmlpip install cssselect pip install pyOpenSSL sudo pip install service_identity # After installation, you can use scrapy startproject test to create a project.

(4) Capture instances.
(1) create a scrapy Project

Digoal @ digoal-pc :~ /Python/spit $ scrapy startproject itzhaopinNew Scrapy project 'itzhaopin' created in:/home/digoal/Python/spit/itzhaopin You can start your first spider: cd itzhaopin scrapy genspider example. comdi#@ dimo-- pc :~ /Python/spit $ digoal @ digoal-pc :~ /Python/spit $ cd itzhaopindi#@ dimo-- pc :~ /Python/spit/itzhaopin $ tree. ── itzhaopin │ ── _ init __. py │ ── items. py │ ── pipelines. py │ ── settings. py │ ── spiders │ ── _ init __. py ── scrapy. cfg # scrapy. cfg: item http://my.oschina.net/lpe234/admin/new-blogfile # items. py: the data structure definition file to be extracted # pipelines. py: Pipeline definition, used to further process the data extracted from items, such as saving # settings. py: crawler configuration file # spiders: directory where spider is placed

(2) define the data structure items. py to be crawled

From scrapy. item import Item, Field # define the data class TencentItem (Item): name = Field () # job name catalog = Field () # job category workLocation = Field () # job location recruitNumber = Field () # recruitment staff detailLink = Field () # job details link publishTime = Field () # release time

(3) implement the Spider class

Spider is a Python class inherited from scarpy. contrib. spiders. CrawlSpider and has three required members.
Name: name, which is the spider identifier.
Start_urls: a url list where spider crawls from these webpages
Parse (): A method. After the webpage in start_urls is captured, you need to call this method to parse the webpage content. At the same time, you need to return the next webpage to be crawled or the items list.

Create a spider under the spiders directory, tencent_spider.py:

#coding=utf-8 from scrapy.spider import BaseSpider  class DmozSpider(BaseSpider):  name = 'dmoz'  allowed_domains = ['dmoz.org']  start_urls = [    'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',    'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/'  ]   def parse(self, response):    filename = response.url.split('/')[-2]    open(filename, 'wb').write(response.info)

This is simpler. Use scrapy crawl dmoz # To Run spider

Articles you may be interested in:

In Python, The urllib + urllib2 + cookielib module write crawler practices
In-depth analysis of the structure and operation process of the Python crawler framework Scrapy
Practice the Python crawler framework Scrapy to capture the TOP250 Douban movie
Use regular expressions to search for Python crawler package Beautiful Soup
Some key points of using the Beautiful Soup package to write crawlers in Python
Create a crawler to capture beautiful pictures in Python
How to Write a Python crawler to capture TOP100 Douban movies and user portraits
Demonstrate the usage of the Python crawler Beautiful Soup using video crawling instances
Tutorial on creating crawler instances using Python's urllib and urllib2 modules

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Python to compile the basic modules and framework Usage Guide for crawlers, and use guide for python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Python to compile the basic modules and framework Usage Guide for crawlers, and use guide for python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support