Python web crawler

Source: Internet
Author: User
Tags python web crawler

First, the problem to be solved

The solution is to automatically search for Google academics based on a custom keyword, parse the search page, and download a PDF link to all the corresponding papers. Here we use Python to implement,

Ii. Getting started with Python

Python auto indent: shift+table the whole block to the left, and the table to the right, which is useful when modifying the entire block of code, such as turning a function into a separate execution.

Learn about Python variables, packages, function definitions, etc.

Iii. Web page Knowledge 3.1 The process of browsing the web

The process of opening a Web page is actually the browser as a browsing "client", sent a request to the server side, the server side of the file "catch" to the local, and then to explain, show.

HTML is a markup language that tags content and parses and differentiates it.

The function of the browser is to parse the acquired HTML code and then turn the original code into the page of the site that we see directly.

3.2 The concept of URIs and URLs

In simple terms, the URL is the http://www.baidu.com string that is entered on the browser side.

Before you understand the URL, first understand the concept of the URI.

A URL is a subset of the URI. It is the abbreviation of Uniform Resource Locator, translated as "Uniform Resource Locator".

In layman's words, URLs are strings that describe information resources on the Internet and are used primarily on various WWW client programs and server programs.

URLs can be used in a unified format to describe various information resources, including files, server addresses and directories.

The general format of the URL is (optional with square brackets []):

Protocol://hostname[:p ORT]/path/[;p Arameters][?query] #fragment

The format of the URL consists of three parts:

① The first part is the protocol (or service mode).

② the second part is the host IP address (and sometimes the port number) where the resource is stored.

③ The third part is the specific address of the host resource, such as directory and file name.

The first part and the second part are separated by the "://" Symbol,

The second and third sections are separated by a "/" symbol.

The first part and the second part are indispensable, and the third part can be omitted sometimes.

Reference to: http://blog.csdn.net/pleasecallmewhy/article/details/8922826

Iv. web crawler 4.1 solve Google can't login

Because the Google academic page to crawl, but Google in China blocked, so need to configure the goagent on the computer, and then the proxy configuration, the code is as follows

Proxy = Urllib2. Proxyhandler ({"http": "http://127.0.0.1:8087", "https": "https://127.0.0.1:8087"})

Opener = Urllib2.build_opener (proxy)

Urllib2.install_opener (opener)

4.2 Troubleshooting Crawl Masking

For a small number of queries, but if you want to make thousands of queries, the above method is no longer valid, Google will detect the source of your request, if we use the machine frequently crawl Google's search results, not long Google will block your IP, and give you back 503 Error page. You can set the URL request headers to disguise our user agent. Simply put, the user agent is a special network protocol used by applications such as client browsers, which is sent to the server every time the browser (mail client/search engine spider) makes an HTTP request, and the server knows what browser the user is using (mail client/ Search engine spiders) to access the. Sometimes in order to achieve some purpose, we have to go to a good-natured deception server to tell it I am not using the machine to access you.

user_agents = [' mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20130406 firefox/23.0 ', \

' mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) gecko/20100101 firefox/18.0 ', \

' Mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/533+ \

(khtml, like Gecko) Element Browser 5.0 ', \

' IBM webexplorer/v0.94 ', ' galaxy/1.0 [en] (Mac OS X 10.5.6; U EN) ', \

' Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; trident/6.0) ', \

' opera/9.80 (Windows NT 6.0) presto/2.12.388 version/12.14 ', \

' Mozilla/5.0 (IPAD; CPU os 6_0 like Mac os X applewebkit/536.26 (khtml, like Gecko) \

version/6.0 mobile/10a5355d safari/8536.25 ', \

' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) \

chrome/28.0.1468.0 safari/537.36 ', \

' Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; trident/5.0; TheWorld) ']

Proxy = Urllib2. Proxyhandler ({"http": "http://127.0.0.1:8087", "https": "https://127.0.0.1:8087"})

Opener = Urllib2.build_opener (proxy)

Urllib2.install_opener (opener)

4.3 Regular expression parsing web pages

Use a single string to describe and match a series of strings that conform to a certain syntactic rule. For example, the requirement now is to find a regular expression to match the string with the suffix ". pdf".

Inputurl = ' http://scholar.google.com/scholar?q=text+detection&btnG=&hl=en&as_sdt=0%2C5 '

Request = Urllib2. Request (Inputurl)

index = random.randint (0, 9)

User_agent = User_agents[index]

Request.add_header (' user-agent ', user_agent)

f = urllib2.urlopen (Request). Read () #打开网页

Print F

Localdir = ' E:\download\\ ' #下载PDF文件需要存储在本地的文件夹

Urllist = [] #用来存储提取的PDF下载的url的列表

For eachline in F: #遍历网页的每一行

line = Eachline.strip () #去除行首位的空格, habitual notation

If Re.match ('. *pdf.* ', line): #去匹配含有 the lines of the "PDF" string, only those lines have PDF

WordList = Line.split (' \ "') #以" is delimited, separating the line so that the URL address is separated separately

For word in wordList: #遍历每个字符串

If Re.match ('. *\.pdf$ ', Word): a string #去匹配含有 ". pdf" that is only available in the URL

Urllist.append (word) #将提取的url存入列表

For Everyurl in urllist: #遍历列表的每一项, which is the URL of each PDF

Worditems = Everyurl.split ('/') #将url以/for bounds, in order to extract the PDF file name

For item in Worditems: #遍历每个字符串

If Re.match ('. *\.pdf$ ', item): #查找PDF的文件名

Pdfname = Item #查找到PDF文件名

Localpdf = Localdir + pdfname #将本地存储目录和需要提取的PDF文件名进行连接

Try

Urllib.urlretrieve (Everyurl, localpdf) #按照url进行下载 and stored in a local directory with its file name

Except Exception, E:

Continue

V. Problems encountered and Solutions 5.1 using the HTTP protocol

When opening Google, if the HTTPS protocol can not open the Web page to get content, only the use of the HTTP protocol, the reason may be HTTPS with encryption protocol.

#-*-Coding:utf-8-*-

"""

Created on Fri Feb 13 16:27:02 2015

@author: Dwanminghuang

"""

Import Urllib #导入urllib模块

Import Urllib2 #导入urllib2模块

Import re

Import re, random, types #导入正则表达式模块: RE module

user_agents = [' mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20130406 firefox/23.0 ', \

' mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) gecko/20100101 firefox/18.0 ', \

' Mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/533+ \

(khtml, like Gecko) Element Browser 5.0 ', \

' IBM webexplorer/v0.94 ', ' galaxy/1.0 [en] (Mac OS X 10.5.6; U EN) ', \

' Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; trident/6.0) ', \

' opera/9.80 (Windows NT 6.0) presto/2.12.388 version/12.14 ', \

' Mozilla/5.0 (IPAD; CPU os 6_0 like Mac os X applewebkit/536.26 (khtml, like Gecko) \

version/6.0 mobile/10a5355d safari/8536.25 ', \

' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) \

chrome/28.0.1468.0 safari/537.36 ', \

' Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; trident/5.0; TheWorld) ']

Proxy = Urllib2. Proxyhandler ({"http": "http://127.0.0.1:8087", "https": "https://127.0.0.1:8087"})

Opener = Urllib2.build_opener (proxy)

Urllib2.install_opener (opener)

Inputurl = ' http://scholar.google.com/scholar?q=text+detection&btnG=&hl=en&as_sdt=0%2C5 '

Request = Urllib2. Request (Inputurl)

index = random.randint (0, 9)

User_agent = User_agents[index]

Request.add_header (' user-agent ', user_agent)

f = urllib2.urlopen (Request). Read () #打开网页

Print F

Localdir = ' E:\download\\ ' #下载PDF文件需要存储在本地的文件夹

Urllist = [] #用来存储提取的PDF下载的url的列表

For eachline in F: #遍历网页的每一行

line = Eachline.strip () #去除行首位的空格, habitual notation

If Re.match ('. *pdf.* ', line): #去匹配含有 the lines of the "PDF" string, only those lines have PDF

WordList = Line.split (' \ "') #以" is delimited, separating the line so that the URL address is separated separately

For word in wordList: #遍历每个字符串

If Re.match ('. *\.pdf$ ', Word): a string #去匹配含有 ". pdf" that is only available in the URL

Urllist.append (word) #将提取的url存入列表

For Everyurl in urllist: #遍历列表的每一项, which is the URL of each PDF

Worditems = Everyurl.split ('/') #将url以/for bounds, in order to extract the PDF file name

For item in Worditems: #遍历每个字符串

If Re.match ('. *\.pdf$ ', item): #查找PDF的文件名

Pdfname = Item #查找到PDF文件名

Localpdf = Localdir + pdfname #将本地存储目录和需要提取的PDF文件名进行连接

Try

Urllib.urlretrieve (Everyurl, localpdf) #按照url进行下载 and stored in a local directory with its file name

Except Exception, E:

Continue

Python web crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.