This article is transferred from Villanch's blog: Original address http://www.freebuf.com/articles/system/100668.html?utm_source=tuicool&utm_medium= Referral
0x00 Introduction
0x01 Requirements
0x02 what you can learn.
0x03 Knowledge Supplement
The simplest start of the 0x04
0x05 More Elegant Solutions
0x06 URL Legality judgment
0X07 Summary and Notice 0x00 Introduction
Crawler technology is an important part of data mining and testing technology, and it is the core of search engine technology.
But as a common technology, ordinary people can also use the reptile technology to do a lot of things, such as: you want to learn about freebuf all about the crawler technology, you can write a crawler to freebuf article search, parsing. For example, you want to get the price of a certain kind of merchandise taobao, you can write a crawler to automatically search for a certain kind of goods, and then get information, get the results you want, a regular day to crawl yourself can decide at what time when low prices buy the right goods. Or if you want to collect some sort of information to set up your own database, but manually copy and paste Special trouble, then the reptile technology can help a lot of it, right? 0x01 Requirements
So this series of articles aims to popularize crawler technology, certainly not the kind of direct crawler framework to illustrate. In this series of articles, I try to simple to difficult, concise introduction of the various elements of reptiles, how to quickly write their own useful code. But there is a small demand for readers: read the Python code, and then you can do some hands-on, in addition to the HTML elements have a certain understanding. 0x02 What you can learn.
Of course, the crawler's article on the internet is very easy to find, but exquisite, systematic explanation of the article or relatively few, the author in this article and future articles will introduce a variety of knowledge about reptiles:
Generally speaking, the writing sequence of this paper is a single crawler to distributed crawler, functional realization to the overall design, from micro to macro.
1. Simple module to write simple crawler
2. A relatively graceful reptile
3. Basic theory of reptiles and general methods
4. Simple Web Data Mining
5. Dynamic Web crawler (can handle JS crawler)
6. Data storage of Reptiles
7. Multi-threading and distributed crawler design
If a reader wants to find some reptile primer books, I recommend the web scraping with Python, the English version of which is currently not in Chinese translation, but online enthusiasts are in translation, interested readers can understand. 0x03 Knowledge Supplement
Here's the knowledge I'm actually going to briefly introduce some of the current mainstream crawler-writing modules:
htmllib (sgmllib), this module is very old a module, the bottom, the actual is simply parsing HTML documents, do not support search tags, fault tolerance is also relatively poor, this refers to the reminder, if the incoming HTML document does not end correctly, This module is not parsed until the correct data is passed in or forced to shut down.
BeautifulSoup, this module parsing HTML is very professional, with good fault-tolerant, you can search any tag, with the Code processing scheme.
Selenium, automated Web test solution, similar to BeautifulSoup, but not the same, Selenium with the JS interpreter, that is, Selenium with the browser can be used to do Dynamic Web page crawl, analysis, mining.
scrapy Framework : A professional crawler framework (stand-alone), has a relatively complete solution.
API Crawler : This is probably all the crawler APIs that need to be paid for, like the Google,twitter solution, not introduced.
The author in the article will only appear in the first three ways to do the crawler writing. The simplest start of the 0x04
As a first example, I'll introduce the simplest of modules and write the simplest one-page crawler:
Urllib This module we used here to get an HTML document for a page, the specific use is,
Web = Urllib.urlopen (URL)
Data = Web.read ()
Note that this is the py2, Py3 is not the same.
Smgllib This library is the bottom of the htmllib, but it can also provide a solution to HTML text, the specific use of the method is:
1. Customizing a class, inheriting Sgmllib's sgmlparser;
2. Replication Sgmlparser method, add your own custom label processing function
3. The data to be parsed is passed into the parser by the object of the custom class, and then the custom method takes effect automatically.
Import urllib
Import sgmllib
class handle_html (Sgmllib. Sgmlparser):
#unknown_starttag这个方法在任意的标签开始被解析时调用
#tag为标签名
#attrs表示标签的参赛
def unknow N_starttag (self, Tag, attrs):
print "-------" +tag+ "Start--------"
print attrs
#unknown_e Ndtag This method is called
def unknown_endtag (self, tag) when any tag ends are parsed:
print "-------" +tag+ "End----------"
web =urllib.urlopen ("http://freebuf.com/")
Web_handler = handle_html ()
#数据传入解析器
web_handler.feed (Web.read ())
Just more than 10 lines of code, the simplest single page crawler is complete, the following is the output effect. We can see that the labels start and end are marked. Each parameter is then printed out at the same time.
<img alt= "A1.png" src= "Http://image.3001.net/images/20160223/145619506490.png!small" width= "676" 426 "></p>
We can then use this low-level parsing method to make a basic example:
The following small example examines the Attrs attribute in the label at the beginning of the tag, parses the href attribute of all parameters, and the reader knows that it is basically a path to a reptile.
Import urllib
Import sgmllib
class handle_html (Sgmllib. Sgmlparser):
Defunknown_starttag (self, Tag, attrs):
#这里利用try与except来避免报错.
#但是并不推荐这样做,
#对于这种小脚本虽然无伤大雅, but in the actual project processing,
#这种做法存在很大的隐患
try: for
attr I N attrs:
if attr[0] = = "href":
printattr[0]+ ":" +attr[1].encode (' Utf-8 ')
except:
pass
web =urllib.urlopen ("http://freebuf.com/")
Web_handler = handle_html ()
Web_handler.feed (Web.read ())
The result of the analysis is:
<img alt= "A2.png" src= "Http://image.3001.net/images/20160223/14561950806172.png!small" Width= "440" 393 "></p>
We found that there are some disharmonious factors in the parsed href, such as the appearance of JavaScript, such as the appearance of other domain names, or some readers say that there is a duplication of URLs. In fact, this is for our Freebuf station, but for the various complex environments on the Internet, the above considerations are completely inadequate. We'll talk about this later.
But the author does not plan to use this method to deal with our problems completely. Because we have a more elegant solution. 0x05 more Elegant Solutions
Of course I said when BeautifulSoup, why choose this module. I personally think this module parsing HTML is very professional, here for short BS4, read BS4 readers are very clear. In fact, BeautifulSoup is not just a simple parsing of HTML documents, in fact, there are a lot of mystery: Five kinds of parser automatic selection or manually specified, each parser biased direction are different, some emphasis on speed, some emphasis on the correct rate. Automatic identification of HTML document encoding, and give a very perfect solution, support CSS filtering, the convenience of various parameters.
BeautifulSoup General steps to use:
1. Import Beatifulsoup Library: from BS4 import BeautifulSoup
2. Incoming data, set up objects: Soup = beautifulsoup, (data),
3. Operation soup, complete requirement analysis.
Let's look at the specific code example:
From BS4 import beautifulsoup
import urllib
import re
web =urllib.urlopen ("http://freebuf.com/")
Soup = BeautifulSoup (Web.read ())
tags_a =soup.findall (name= "A", attrs={' href ': Re.compile ("^https?:/ /")}) for
tag_a in tags_a:
printtag_a[" href "]
This section is the same function as the second short code of Sgmllib, but it is more elegant to write. Then it introduces regular expressions, filters the linked expressions a little, filters out the JavaScript words, and obviously looks a lot simpler:
<img alt= "A3.png" src= "Http://image.3001.net/images/20160223/14561951119777.png!small" width= "" height= " 438 "></p>
Briefly explain the warning above:
Userwarning:no parser is explicitlyspecified, so I ' m using the best available HTML parser to this system ("Html.parser") . This usually isn ' t a problem, but if your run Thiscode on another system, or in a different virtual environment Adifferent parser and behave differently.
To the rid of this warning:
BeautifulSoup ([Your markup])
To this:
BeautifulSoup ([Your markup], "Html.parser")
The above is said: No particular parser, BS4 use it think the best parser html.parser, which generally will not be a problem, but if you run in different environments, the parser may not be the same. To remove this warning you can change your beautifulsoup option to BeautifulSoup (data, "Html.parser")
This warning shows the characteristics of the BS4 's automatic selection parser to parse. 0x06 URL and legality judgment
URL and URI is actually a thing, if we do not mention the URI more, then let's talk about URL processing: If, as we did in the beginning, we manually, or through the regular analysis of each URL, we have to consider the various structure of the URL, such as the following examples:
Path?ss=1#arch
http://freebuf.com/geek
ss=1
path/me
javascript:void (0)
/freebuf.com/s/s/s /
sssfadea://ssss.ss
path?ss=1&s=1
ftp://freeme.com/ss/s/s
path?ss=1
#arch
// freebuf.com/s/s/s/
https://freebuf.com:443/geek?id=1#sid
//freebuf.com/s/s/s
We are probably dealing with so many different forms of URLs, which are very likely URLs on the Web page, so how do we judge the legality of these?
First with//separate, the left when the Agreement + ': ', the right to the first '/' is the domain name, the domain name behind the path. Back-time arguments, #后面是锚点.
It shouldn't be a very difficult thing to write code for this analysis, but we don't have to write code every time to solve the problem, after all, we're using python, and these things don't need to be done by ourselves,
In fact, I personally think this is to benefit from Python's powerful module: Urlparser, this module is to our above the URL analysis of the idea to do the implementation, the use is pythonic:
Import urlparse
url = set ()
Url.add (' javascript:void (0) ')
url.add (' Http://freebuf.com/geek ')
Url.add (' Https://freebuf.com:443/geek?id=1#sid ')
url.add (' ftp://freeme.com/ss/s/s ')
url.add (' Sssfadea ://ssss.ss ')
url.add ('//freebuf.com/s/s/s ')
url.add ('/freebuf.com/s/s/s/')
url.add ('//freebuf.com/ s/s/s/')
url.add (' path/me ') url.add (' Path?ss=1 ') url.add (' Path?ss=1&s=1 ') url.add ('
path ? ss=1#arch ')
url.add ('? Ss=1 ')
url.add (' #arch ') for the
item in URL:
print item
o= Urlparse.url Parse (item)
Print O
print
And then execute the code, and we can look at the specific parsing result:
Import urlparse URL = set () Url.add (' javascript:void (0) ') url.add (' Http://freebuf.com/geek ') url.add (' Https://freebuf . Com:443/geek?id=1#sid ') url.add (' ftp://freeme.com/ss/s/s ') url.add (' SSSFADEA://SSSS.SS ') url.add ('//freebuf.com/
S/S/S ') url.add ('/freebuf.com/s/s/s/') url.add ('//freebuf.com/s/s/s/') url.add (' path/me ') url.add (' Path?ss=1 ') Url.add (' Path?ss=1&s=1 ') url.add (' Path?ss=1#arch ') url.add ('? Ss=1 ') url.add (' #arch ') for the item in URL: &NBS P;print Item o= Urlparse.urlparse (item) print o printpath?ss=1#arch parseresul
T (scheme= ', netloc= ', path= ' path ', params= ', query= ' Ss=1 ', fragment= ' arch ') Http://freebuf.com/geek
Parseresult (scheme= ' http ', netloc= ' freebuf.com ', path= '/geek ', params= ', query= ', fragment= ') ? Ss=1 Parseresult (scheme= ', netloc= ', path= ', params= ', query= ' Ss=1 ', fragment= ') path/me parseresult (scheme= '), Netloc= ', path= ' path/me ', params= ', query= ', fragment= ') javascript:void (0) Parseresult (scheme= ' JavaScript ', netloc= ', path= ' void (0) ', params= ', query= '), fragment= "") /freebuf.com/s/s/s/Parseresult (scheme= ', netloc= ', path= '/freebuf.com/s/s/s/', params= ', query= ', Fragment= ') SSSFADEA://SSSS.SS parseresult (scheme= ' Sssfadea ', netloc= ' ssss.ss ', path= ', params= ', query= ', Fragment= ') Path?ss=1&s=1 parseresult (scheme= ', netloc= ', path= ' path ', params= ', query= ' Ss=1&s=1 ', Fragment= ') ftp://freeme.com/ss/s/s parseresult (scheme= ' ftp ', netloc= ' freeme.com ', path= '/ss/s/s ', params= ') , query= ', fragment= ') Path?ss=1 parseresult (scheme= ', netloc= ', path= ' path ', params= ', query= ' Ss=1 ', Fragme Nt= ') #arch parseresult (scheme= ', netloc= ', path= ', params= ', query= ', fragment= ' arch ') //freebuf. com/s/s/s/Parseresult (scheme= ', netloc= ' freebuf.com ', path= '/s/s/s/', params= ', query= ', fragment= ') HTTPS ://freebuf.com:443/geek?id=