First, the problem to be solved
The solution is to automatically search for Google academics based on a custom keyword, parse the search page, and download a PDF link to all the corresponding papers. Here we use Python to implement,
Ii. Getting started with Python
Python auto indent: shift+table the whole block to the left, and the table to the right, which is useful when modifying the entire block of code, such as turning a function into a separate execution.
Learn about Python variables, packages, function definitions, etc.
Iii. Web page Knowledge 3.1 The process of browsing the web
The process of opening a Web page is actually the browser as a browsing "client", sent a request to the server side, the server side of the file "catch" to the local, and then to explain, show.
HTML is a markup language that tags content and parses and differentiates it.
The function of the browser is to parse the acquired HTML code and then turn the original code into the page of the site that we see directly.
3.2 The concept of URIs and URLs
In simple terms, the URL is the http://www.baidu.com string that is entered on the browser side.
Before you understand the URL, first understand the concept of the URI.
A URL is a subset of the URI. It is the abbreviation of Uniform Resource Locator, translated as "Uniform Resource Locator".
In layman's words, URLs are strings that describe information resources on the Internet and are used primarily on various WWW client programs and server programs.
URLs can be used in a unified format to describe various information resources, including files, server addresses and directories.
The general format of the URL is (optional with square brackets []):
Protocol://hostname[:p ORT]/path/[;p Arameters][?query] #fragment
The format of the URL consists of three parts:
① The first part is the protocol (or service mode).
② the second part is the host IP address (and sometimes the port number) where the resource is stored.
③ The third part is the specific address of the host resource, such as directory and file name.
The first part and the second part are separated by the "://" Symbol,
The second and third sections are separated by a "/" symbol.
The first part and the second part are indispensable, and the third part can be omitted sometimes.
Reference to: http://blog.csdn.net/pleasecallmewhy/article/details/8922826
Iv. web crawler 4.1 solve Google can't login
Because the Google academic page to crawl, but Google in China blocked, so need to configure the goagent on the computer, and then the proxy configuration, the code is as follows
Proxy = Urllib2. Proxyhandler ({"http": "http://127.0.0.1:8087", "https": "https://127.0.0.1:8087"})
Opener = Urllib2.build_opener (proxy)
Urllib2.install_opener (opener)
4.2 Troubleshooting Crawl Masking
For a small number of queries, but if you want to make thousands of queries, the above method is no longer valid, Google will detect the source of your request, if we use the machine frequently crawl Google's search results, not long Google will block your IP, and give you back 503 Error page. You can set the URL request headers to disguise our user agent. Simply put, the user agent is a special network protocol used by applications such as client browsers, which is sent to the server every time the browser (mail client/search engine spider) makes an HTTP request, and the server knows what browser the user is using (mail client/ Search engine spiders) to access the. Sometimes in order to achieve some purpose, we have to go to a good-natured deception server to tell it I am not using the machine to access you.
user_agents = [' mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20130406 firefox/23.0 ', \
' mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) gecko/20100101 firefox/18.0 ', \
' Mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/533+ \
(khtml, like Gecko) Element Browser 5.0 ', \
' IBM webexplorer/v0.94 ', ' galaxy/1.0 [en] (Mac OS X 10.5.6; U EN) ', \
' Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; trident/6.0) ', \
' opera/9.80 (Windows NT 6.0) presto/2.12.388 version/12.14 ', \
' Mozilla/5.0 (IPAD; CPU os 6_0 like Mac os X applewebkit/536.26 (khtml, like Gecko) \
version/6.0 mobile/10a5355d safari/8536.25 ', \
' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) \
chrome/28.0.1468.0 safari/537.36 ', \
' Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; trident/5.0; TheWorld) ']
Proxy = Urllib2. Proxyhandler ({"http": "http://127.0.0.1:8087", "https": "https://127.0.0.1:8087"})
Opener = Urllib2.build_opener (proxy)
Urllib2.install_opener (opener)
4.3 Regular expression parsing web pages
Use a single string to describe and match a series of strings that conform to a certain syntactic rule. For example, the requirement now is to find a regular expression to match the string with the suffix ". pdf".
Inputurl = ' http://scholar.google.com/scholar?q=text+detection&btnG=&hl=en&as_sdt=0%2C5 '
Request = Urllib2. Request (Inputurl)
index = random.randint (0, 9)
User_agent = User_agents[index]
Request.add_header (' user-agent ', user_agent)
f = urllib2.urlopen (Request). Read () #打开网页
Print F
Localdir = ' E:\download\\ ' #下载PDF文件需要存储在本地的文件夹
Urllist = [] #用来存储提取的PDF下载的url的列表
For eachline in F: #遍历网页的每一行
line = Eachline.strip () #去除行首位的空格, habitual notation
If Re.match ('. *pdf.* ', line): #去匹配含有 the lines of the "PDF" string, only those lines have PDF
WordList = Line.split (' \ "') #以" is delimited, separating the line so that the URL address is separated separately
For word in wordList: #遍历每个字符串
If Re.match ('. *\.pdf$ ', Word): a string #去匹配含有 ". pdf" that is only available in the URL
Urllist.append (word) #将提取的url存入列表
For Everyurl in urllist: #遍历列表的每一项, which is the URL of each PDF
Worditems = Everyurl.split ('/') #将url以/for bounds, in order to extract the PDF file name
For item in Worditems: #遍历每个字符串
If Re.match ('. *\.pdf$ ', item): #查找PDF的文件名
Pdfname = Item #查找到PDF文件名
Localpdf = Localdir + pdfname #将本地存储目录和需要提取的PDF文件名进行连接
Try
Urllib.urlretrieve (Everyurl, localpdf) #按照url进行下载 and stored in a local directory with its file name
Except Exception, E:
Continue
V. Problems encountered and Solutions 5.1 using the HTTP protocol
When opening Google, if the HTTPS protocol can not open the Web page to get content, only the use of the HTTP protocol, the reason may be HTTPS with encryption protocol.
#-*-Coding:utf-8-*-
"""
Created on Fri Feb 13 16:27:02 2015
@author: Dwanminghuang
"""
Import Urllib #导入urllib模块
Import Urllib2 #导入urllib2模块
Import re
Import re, random, types #导入正则表达式模块: RE module
user_agents = [' mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20130406 firefox/23.0 ', \
' mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) gecko/20100101 firefox/18.0 ', \
' Mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/533+ \
(khtml, like Gecko) Element Browser 5.0 ', \
' IBM webexplorer/v0.94 ', ' galaxy/1.0 [en] (Mac OS X 10.5.6; U EN) ', \
' Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; trident/6.0) ', \
' opera/9.80 (Windows NT 6.0) presto/2.12.388 version/12.14 ', \
' Mozilla/5.0 (IPAD; CPU os 6_0 like Mac os X applewebkit/536.26 (khtml, like Gecko) \
version/6.0 mobile/10a5355d safari/8536.25 ', \
' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) \
chrome/28.0.1468.0 safari/537.36 ', \
' Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; trident/5.0; TheWorld) ']
Proxy = Urllib2. Proxyhandler ({"http": "http://127.0.0.1:8087", "https": "https://127.0.0.1:8087"})
Opener = Urllib2.build_opener (proxy)
Urllib2.install_opener (opener)
Inputurl = ' http://scholar.google.com/scholar?q=text+detection&btnG=&hl=en&as_sdt=0%2C5 '
Request = Urllib2. Request (Inputurl)
index = random.randint (0, 9)
User_agent = User_agents[index]
Request.add_header (' user-agent ', user_agent)
f = urllib2.urlopen (Request). Read () #打开网页
Print F
Localdir = ' E:\download\\ ' #下载PDF文件需要存储在本地的文件夹
Urllist = [] #用来存储提取的PDF下载的url的列表
For eachline in F: #遍历网页的每一行
line = Eachline.strip () #去除行首位的空格, habitual notation
If Re.match ('. *pdf.* ', line): #去匹配含有 the lines of the "PDF" string, only those lines have PDF
WordList = Line.split (' \ "') #以" is delimited, separating the line so that the URL address is separated separately
For word in wordList: #遍历每个字符串
If Re.match ('. *\.pdf$ ', Word): a string #去匹配含有 ". pdf" that is only available in the URL
Urllist.append (word) #将提取的url存入列表
For Everyurl in urllist: #遍历列表的每一项, which is the URL of each PDF
Worditems = Everyurl.split ('/') #将url以/for bounds, in order to extract the PDF file name
For item in Worditems: #遍历每个字符串
If Re.match ('. *\.pdf$ ', item): #查找PDF的文件名
Pdfname = Item #查找到PDF文件名
Localpdf = Localdir + pdfname #将本地存储目录和需要提取的PDF文件名进行连接
Try
Urllib.urlretrieve (Everyurl, localpdf) #按照url进行下载 and stored in a local directory with its file name
Except Exception, E:
Continue
Python web crawler