Use python for a simple Web Crawler

Last Update:2014-05-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview:

This is a simple crawler, and its function is also very simple: Given a url, crawling the page of the url, then extracting the url addresses that meet the requirements, put these addresses in the queue, after the given web page is captured, the URL in the queue is used as a parameter, and the program crawls the data on this page again. It stops until it reaches a certain depth (specified by the parameter. The program saves the captured webpage data locally. I am using a mysql database. The following is the beginning.

Create a database:

Start mysql and create a database

Createdatabasespcharactersetutf8;

Create a table that contains three fields: A save url, the original html code, and the data after removing the html Tag. The third reason is to make future search more efficient.

Use sp;

Createtablewebdata (url longtext, html longtext, puredata longtext );

After the database is ready, it starts to write code.

Python program:

I will not explain the program too much. The key part of the program is annotated. The program parameters are described as follows:

-U: the URL to be crawled

-D Capture depth: the number of layers of pages crawled along the link. Each layer of the page increases exponentially. The default value is 2.

-T number of concurrent threads. The default value is 10 threads.

-O timeout value. The timeout threshold of urlopen. The default value is 20 seconds.

-L specify the path and name of the log file. The default value is the current path, which is logSpider. log.

-V specifies the log record details. There are three parameters. The default value is normal.

Simple records only error messages

In addition to the error information, normal also records the status information during the running of the program.

All information and crawled URLs are recorded.

The default value of timeout for each system is described as follows:

BSD75 secondsLinux189 secondsSolaris225 secondsWindows xp22seconds

Modify the mysql configuration file:

During my experiments, I found that if the capture depth is 2, the program can run smoothly. However, when the depth is adjusted to 3, the 2006-MySQL server has gone away error will occur. After modifying the mysql configuration file on the internet, I solved the problem. In the configuration file, modify max_allowed_packet = 1 M to max_allowed_packet = 16 M.

Run the program with this parameter:

Python spider. py-u http://www.chinaunix.net-d 3-t 15-o 10

After running the program for 26 minutes, 4346 pages are successfully captured and a 35 Kb log is generated. The program can capture 2.8 pages per second on average. The most common log record is that a website cannot be opened.

Okay, the code below:

#-*-Coding: UTF-8 -*-

Fromreimportsearch

Importurllib2

ImportMySQLdb

FromBeautifulSoupimportBeautifulSoup

Importthreading

Fromdatetimeimportdatetime

FromoptparseimportOptionParser

Importsys

Importlogging

Importsocket

Fromurlparseimporturlparse

Importhttplib

URLS = {}

Lock = threading. Lock ()

ClassnewThread (threading. Thread ):

Def _ init _ (self, level, url, db ):

Threading. Thread. _ init _ (self)

Self. level = level

Self. url = url

Self. db = db

Defrun (self ):

Globallock

Globallog

Foriinself. url:

Log. debug ('% s: % s' % (datetime. now (), I ))

Printi

Temp, html, data = getURL (I)

# Because the url cannot be opened, the returned status code is not 200,

# Discard this url and start loop again

Ifnottemp:

Continue

# Obtain the lock so that this thread can safely update data

Iflock. acquire ():

Self. db. save (I, html, data)

# All threads store the collected URLS to the URLS list,

# Delete duplicate URLs in the main thread.

URLS [self. level]. extend (temp)

Lock. release ()

ClasssaveData ():

Def _ init _ (self ):

Self. db = MySQLdb. connect (user = 'root', db = 'SP ', unix_socket ='/tmp/mysql. sock ')

Self. cur = self. db. cursor ()

Self.cur.exe cute ('delete from webdata ')

Self. commit ()

Log.info ('% s: Connect database success' % datetime. now ())

Defsave (self, url, html, pureData ):

Globallog

SQL = ''' insert into webdata values ('% s',' % s', '% s') ''' % (url, html, pureData)

Try:

Self.cur.exe cute (SQL)

Except T (MySQLdb. ProgrammingError, MySQLdb. OperationalError), e:

Log. error ('% s: % s' % (datetime. now (), e ))

Return

Self. commit ()

Defcommit (self ):

Self. db. commit ()

Defclose (self ):

Self. db. close ()

DefgetURL (url ):

URLS = []

Globallog

Globalsource

GlobaldomainName

Try:

Page = urllib2.urlopen (url)

Handle T (urllib2.URLError, httplib. BadStatusLine ):

Log. error ('% s: url can not open ---- % s' % (datetime. now (), url ))

Return ('','','')

Else:

Ifpage. code = 200:

Try:

Html = page. read (). decode ('gbk', 'ignore'). encode ('utf-8 ')

Except t:

Log. error ('% s: time out ---- % s' % (datetime. now (), url ))

Print 'time out'

Return ('','','')

Else:

Log. error ('% s: response code is not 200 ---- % s' % (datetime. now (), url ))

Return ('','','')

Html = html. replace ("'",'"')

# Retrieving data after removing HTML elements

Try:

PureData = ''. join (BeautifulSoup (html). findAll (text = True). encode ('utf-8 ')

Deletunicodeencodeerror:

PureData = html

# The following code is used to search for url addresses that meet the criteria on the webpage.

RawHtml = html. split ('\ n ')

ForiinrawHtml:

Times = I. count ('')

Iftimes:

Foryinrange (times ):

Pos = I. find ('')

Ifpos! =-1:

# Search for the mark on the webpage and extract the link,

# There are two types of links: Double quotation marks and single quotation marks.

NewURL = search ('<a href = ". +"', I [: pos])

IfnewURLisnotNone:

NewURL = newURL. group (). split ('') [1] [6:-1]

If '"> 'innewurl:

NewURL = search ('. + ">', newURL)

IfnewURLisNone:

Continue

NewURL = newURL. group () [:-2]

# If the address is blank, the next loop is displayed.

IfnotnewURL:

Continue

# For relative addresses, convert them to absolute addresses.

IfnotnewURL. startswith ('http '):

IfnewURL [0] = '/':

NewURL = source + newURL

Else:

NewURL = source + '/' + newURL

IfdomainNamenotinnewURLornewURLinURLSornewURL = urlornewURL = url + '/':

Continue

URLS. append (newURL)

I = I [pos + 4:]

Return (URLS, html, pureData)

If _ name __= = '_ main __':

USAGE = '''

Spider-u [url]-d [num]-t [num]-o [secs]-l [filename]-v [level]

-U: url of a websit

-D: the deeps of the spider will get into. default is 2

-T: how many threads work at the same time. default is 10

-O: url request timeout. default is 20 secs

-L: assign the logfile name and location. default name is 'logspider. Log'

-V: values are 'normal' all '. default is 'normal'

'Simple' ---- only log the error message

'Normal' ---- error message and some addtion message

'All' ---- not only message, but also urls will be logged.

Examples:

Spider-uhttp: // www.chinaunix.net-t 16-v normal

'''

LEVELS = {'simple': logging. WARNING,

'Normal': logging. INFO,

'All': logging. DEBUG}

Opt = OptionParser (USAGE)

Opt. add_option ('-U', type = 'string', dest = 'url ')

Opt. add_option ('-d', type = 'int', dest = 'level', default = 2)

Opt. add_option ('-t', type = 'int', dest = 'nums', default = 10)

Opt. add_option ('-O', type = 'int', dest = 'out', default = 20)

Opt. add_option ('-l', type = 'string', dest = 'name', default = 'logspider. log ')

Opt. add_option ('-V', type = 'string', dest = 'logtype', default = 'normal ')

Options, args = opt. parse_args (sys. argv)

Source = options. url

Level = options. level

ThreadNums = options. nums

Timeout = options. out

Logfile = options. name

LogType = options. logType

Ifnotsourceorlevel <0 orthreadNums <1 ortimeout <1orlogTypenotinLEVELS. keys ():

Printopt. print_help ()

Sys. exit (1)

Ifnotsource. startswith ('HTTP ://'):

Source = 'HTTP: // '+ source

Ifsource. endswith ('/'):

Source = source [:-1]

DomainName = urlparse (source) [1]. split ('.') [-2]

IfdomainNamein ['com ', 'edu', 'net', 'org', 'gov ', 'info', 'cn']:

DomainName = urlparse (source) [1]. split ('.') [-3]

Socket. setdefatimetimeout (timeout)

Log = logging. getLogger ()

Handler = logging. FileHandler (logfile)

Log. addHandler (handler)

Log. setLevel (LEVELS [logType])

StartTime = datetime. now ()

Log.info ('started at % s' % startTime)

SubURLS = {}

Threads = []

Foriinrange (level + 1 ):

URLS [I] = []

# Initialization-link to the database

Db = saveData ()

# Obtain the url on the homepage.

URLS [0], html, pureData = getURL (source)

IfnotURLS [0]:

Log. error ('could not open % s' % source)

Print 'cannot open' + source

Sys. exit (1)

Db. save (source, html, pureData)

Forleinrange (level ):

# Cut the current URLS large list into a small list based on the number of threads

NowL = '------------- level % d ------------' % (le + 1)

PrintnowL

Log.info (nowL)

PreNums = len (URLS [le])/threadNums

Foriinrange (threadNums ):

Temp = URLS [le] [: preNums]

Ifi = threadNums-1:

SubURLS [I] = URLS [le]

Else:

SubURLS [I] = temp

URLS [le] = URLS [le] [preNums:]

# Add the thread to the thread pool and start it. First, clear the thread pool.

Threads = threads [0: 0]

Foriinrange (threadNums ):

T = newThread (le + 1, subURLS [I], db)

T. setDaemon (True)

Threads. append (t)

Foriinthreads:

I. start ()

# Wait until all threads end

Foriinthreads:

I. join ()

NowLevel = le + 1

# Remove the same url from the list

URLS [nowLevel] = list (set (URLS [nowLevel])

Foriinrange (nowLevel ):

ForurlinURLS [I]:

IfurlinURLS [nowLevel]:

URLS [nowLevel]. remove (url)

# Writing data to a database

# Db. commit ()

Db. close ()

EndTime = datetime. now ()

Log.info ('ended at % s' % endTime)

Log.info ('takes % s' % (endTime-startTime ))

With locally stored data, you can search for the data. In fact, I don't know how the search engine searches the Internet based on keywords. This is just a demonstration. If you still remember the three fields in the database table that I mentioned earlier, you do not need to explain this procedure. The program retrieves the entered words in puredata. If the words are retrieved, the url is output.

ImportMySQLdb

Db = MySQLdb. connect (user = 'root', db = 'SP ', unix_socket ='/tmp/mysql. sock ')

Cur = db. cursor ()

Nums‑cur.exe cute ('select * from webdata ')

Print '% d items' % nums

X = cur. fetchall ()

Print 'input something to search, "exit" to exit'

WhileTrue:

Key = raw_input ('> ')

Ifkey = 'exit ':

Break

Foriinrange (nums ):

Ifkeydetail [I] [2]:

Printx [I] [0]

Print 'search finished'

Db. close ()

Finally, let's get you a search result:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use python for a simple Web Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use python for a simple Web Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support