Crawler is mainly to filter out the useless information in the Web page, crawl Web pages useful information
The typical crawler structure is:
Before the Python crawler to have a certain understanding of the structure of the Web page, such as Web page tags, web language knowledge, recommended to W3school:
W3school Links for more information
There are a few tools to make before crawling:
1. First Python development environment: Here I chose the python2.7, the development of the IDE in order to install debugging easy to choose the Python plugin on the VS2013, on vs development (Python program debugging and C debugging is almost familiar);
2. Web page source code viewing tool: Although every browser can do the Web page source code view, but here I still recommend Firefox and Firbug plug-in (these two are also the web developer must use one of the tools);
The installation of the Firbug plugin can be installed in the Add component on the right;
Second, try to look at the source code of the webpage, here I take the basketball data we want to crawl for example:
As an example, I want to crawl the team comparison table content in the Web page:
First right click on the score I want to crawl 32-49, right click to select the Firbug view elements, (Firbug also has a benefit is that when viewing the source code will be displayed on the Web page display of the style, On the page of my location and content below the page will be out of the source code and 32-49 score location and source code such as:
You can see 32-49 as the source code for the Web page:
<td class="sdi-datacell" align="center">32-49</td>
Where TD is the name of the label, class for the name of the classes, align for the format, 32-49 for the contents of the tag, for the content we want to crawl;
But similar tags and the name of the class in the same page there are many, just rely on these two elements can not climb down the data we need, then we need to look at the label's parent tag, or another level of tags to extract more of the features we want to crawl data, to filter other data we do not crawl, As we select the label of this table as the second one we filter
Characteristics:
<div class="sdi-so"><h3>Team Comparison</h3>
Then we'll analyze the URL of the webpage:
If the URL of the page we want to crawl is:
http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/nba/matchups/g5_preview_12.html
Because there is experience on site, so can be here
Www.covers.com is the domain name;
/pageloader/pageloader.aspxpage=/data/nba/matchups/g5_preview_12.html, possibly/pageloader/for the root of the Web page that is placed on the server pageloader.aspx?page=/data/nba/matchups/the page in the address,
For ease of management, the same type of pages will be placed under the same folder, named in a similar way: If this page is named g5_preview_12.html, so similar pages will change the G5 in 5, or _12 in 12, by changing the two numbers, We found that similar pages can be changed by 12 numbers,
To learn the crawler again:
The Python crawler here uses the main
Urllib2
BeautifulSoup
These two libraries, BeautifulSoup's detailed documentation can be viewed in the following Web sites:
Https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
When crawling a webpage:
First open the Web page, and then call the BeautifulSoup library for the analysis of the Web page, and then use the. Find function to find the location of the feature that we are analyzing, and using. Text to get the contents of the tag, which is the data we want to crawl
For example, we analyze the following code:
Response=urllib2.urlopen (URL) print Response.getcode () soup=beautifulsoup (response,' Html.parser ', from_encoding=' Utf-8 ') Links2=soup.find_all (' div ', class_="Sdi-so", limit=2) cishu=0 forIinchLINKS2:if(cishu==1): Both=i.find_all (' TD ', class_="Sdi-datacell") forQinch Both: Print Q.textTable.Write(Row,col,q.text) col= (col+1)%9 if(col==0): row=row+1row=row+1 file. Save (' Nba.xls ') cishu=cishu+1
Urllib2.urlopen (URL) is to open the Web page;
Print Response.getcode () to test whether the Web page can be opened;
Soup=beautifulsoup (
response,
' Html.parser ',
from_encoding= ' utf-8 '
)
Analysis of Web pages for surrogate beautiful;
Links2=soup.find_all (' div ', class_= ' Sdi-so ", limit=2) for the query with eigenvalues and returns
where we want to find ' div ', class_= ' sdi-so ', the label, limit=2 to limit the search for two (this is to filter other similar tags)
for i in LINKS2: if (Cishu==1 ): two =i.find_all ( ' TD ' , Class_=" Sdi-datacell ") for q in two : Print Q.text table.write (Row,col,q.text ) col= (Col+1 )%9 if (Col==0 ): Row=row+1 row=row+1
In order to find the ' div ', class_= "Sdi-so", the label of the corresponding such as ' TD ', class_= "Sdi-datacell" label search;
Q.text to return the data we want
Here row=row+1,row=row+1 for us to write data to Excel file format of the collation of the use;
The next step is to save the crawled data:
Here we use Excel to save the data using the package:
Xdrlib,sys, XLWT
Function:
FILE=XLWT. Workbook ()
Table=file.add_sheet (' Shuju ', cell_overwrite_ok=true)
Table.write (0,0, ' team ')
Table.write (0,1, ' w/l ')
Table.write (Row,col,q.text)
File.save (' Nba.xls ')
Write functions for the most basic Excel, which is no longer described here;
Finally we crawl down the data save format after the style is:
Nice
I think the deepest love, no more than after the separation, I will myself, to live as you look.
Python crawler: Using BeautifulSoup for NBA data crawling