Some of the Python crawler's experiences

Last Update:2016-05-07 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the plug-in installation of Firefox

1.XPath Checker: An interactive edit XPath expression that selects ' View XPath ' in the Web page to see the XPath path,

2.firebug:firebug provides you with a rich development tool that is readily available while browsing the web for your Firefox.

You can edit, debug, and monitor the CSS, HTML, and JavaScript of any Web page in real time.

As long as there is Firefox browser, download the Xpi file directly to the browser installation, if the error appears:

If this error occurs, enter: About:config in the browser, double-click the Blue marquee value to false.

Second, Python selenium

Here is an article about selenium detailed introduction: must first look at this article

Http://www.cnblogs.com/fnng/p/3160606.html

Installation of 2.1 Selenium

os:centos6.4

python:2.7.3

The python2.6.6 version is incompatible with Centos6.4.

So you need to upgrade the python:

#python    -v Python 2.6.

#wget http://python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2  #下载python #tar-jxvf python-2.7.3.tar.bz2 #解压 #cd Python-2.7.3   #更改工作目录 # #编译及安装: #./configure#make all#make install#make clean#make Distclean #/usr/local/bin/ Python2.7-v #查看新安装的版本信息 # #建立软连接 (which is a shortcut), is the system default point to 2.7python#mv/usr/bin/python/usr/bin/python2.6.6#python-v python 2.7.3#vim/usr/bin/yum #解决系统Python软链接指向Python2.7, since Yum is incompatible with Python2.7, Yum does not work and we need to specify the Python version of Yum </span ></span> #将文件头部的: #!/usr/bin/python# change to: #!/usr/bin/python2.6.6

Selenium Installation:

#wget https://bootstrap.pypa.io/get-pip.py--no-check-certificate   #下载pip #python get-pip.py    #安装pip #pip Install Selenium   #安装Selenium

Above, the basic environment has been solved. The following begins the process of body development ....

Objective: The bottom right corner of the website report generation function is rotten. It's rotten. No, but to get the top 100 pages, which is the top 2000 of traffic.

There is no way, so many can not be a copy it, can only open their own:

After reading the article at the beginning, presumably the usage of selenium has a certain understanding, then the beginning:

Import all the necessary libraries, some of which are temporarily unavailable, because there is something else in the script, which is deleted,

Pilot in, may be useful later .... I know I'm low ...

From selenium import webdriverimport htmlparserimport urlparseimport urllibimport urllib2import cookielibimport Stringimport reimport sysfrom selenium.webdriver.common.by import byfrom selenium.webdriver.support.ui Import Webdriverwaitfrom Selenium.webdriver.support Import Expected_conditionsimport time

Browser = Webdriver. Firefox ()   #打开火狐浏览器, you can try their own effect, is to open their own Firefox browser ~ ~ ~

Browser.get (Hosturl)

Here we will enter the user name password, how to do it? Of course first to locate the Name,password and landing button, and then fill in the corresponding user name password Ah!!

The next step is to start analyzing the HTML:

Point to the box where you want to fill in the user name, and then click View Element:

The following blue section, of course, is the HTML code for this text box:

We see that the ' id ' of this text box is ' name ', so we can navigate to it like this:

Name = Browser.find_element_by_name ("name")           #定位这个文本框并将它赋值给name变量passwd = Browser.find_element_by_name (" Password ")     #下面就不赘述了, are the same steps, submit = Browser.find_element_by_name (" login ")           name.send_keys (" admin ")                               # After positioning to this text box of course to enter the account password, can also be obtained through the cache, this here does not say Passwd.send_keys ("eb7bfb8a211ab7980191da0dcfaf5e00")  # Because the website password is encrypted with MD5, you can use the MD5 tool to convert the password. Submit.submit ()                                       #这相当于提交 # Hey, it's a success, it's close to our goal of getting the ranking data.

Browser.implicitly_wait (3)                            #等个3秒, in case the brush is slow, you know. #登陆之后我们再看下主页面的html代码, how to get traffic Rank data # such as: We see the website Access traffic statistics is an href (hyperlink), and here the content in an IFRAME (a web framework) inside

In that case, what we need to do is to find this iframe first, and then click on the ' Website Access statistics ' hyperlink

#我们看到iframe的name是 ' main ', enter this iframe first: Browser.switch_to_frame ("main") #然后我们点击 ' website Access traffic ' this hyperlink browser.find_ Element_by_partial_link_text ("Website Access Traffic Statistics"). Click () #总结下, the idea is that you first navigate to the element that you want to manipulate, and then do what you want to do, usually the input, click, Submit, Next is the ultimate goal, is to climb down the rankings, continue to analyze the HTML:

Take the first name of this site for example:

With our previously installed XPath tool, we can see the XPath path for the row name field ' 1 ' such as:

Look in turn, find the ' URL host ' This field of Xpath:id (' top_table ')/x:table/x:tbody/x:tr[2]/x:td[2], found the law bar. The back is 3, 4, 5 ... .

Then look at ' rank ' second: ID (' top_table ')/x:table/x:tbody/x:tr[3]/x:td[1], so the law is similar. ' Rank ' 3rd, just tr[4]. is not very simple,,,,

Then we just write a loop, the data are all done down!!!

Def get (): i = 2 while I < 22:paiming = Browser.find_element_by_xpath ("//div[@id = ' Top_tab Le ']/table/tbody/tr[%s]/td[1] "%i). Text host = Browser.find_element_by_xpath ("//div[@id = ' top_table ']/table /TBODY/TR[%S]/TD[2] "%i". Text Outl = Browser.find_element_by_xpath ("//div[@id = ' top_table ']/table/tbody/tr[% S]/TD[3] "%i". Text INL = Browser.find_element_by_xpath ("//div[@id = ' top_table ']/table/tbody/tr[%s]/td[4]"%i                ). Text total = Browser.find_element_by_xpath ("//div[@id = ' top_table ']/table/tbody/tr[%s]/td[5]"%i). Text tt = paiming + ' + host + ' + outl + ' + inl + ' + total print tt i + = 1 #这个函数就是把第一页的数据给弄出来, and print to the screen, because the first 100 pages, so also write a loop, to traverse this 100 page, but each page operation is the same: Def next_page ():         browser.find_element_by_partial_link_text ("next Page"). Click () #这里的变量命名比较乱, at will, everyone, don't mind, mainly to understand how to work on the good--~! will pay attention to the specification later ...

j = 1while J <:        get ()        next_page ()        J + = 1

#因为只是用来学习, in order to realize the function, so there is no judge what anomaly, in fact, is not standard. #!/usr/bin/python#encoding=utf-8hosturl = ' http://172.16.5.248 ' browser = webdriver. Firefox () browser.get (hosturl) name = Browser.find_element_by_name ("name") passwd = Browser.find_element_by_name (" Password ") Submit = Browser.find_element_by_name (" login ") Name.send_keys (" admin ") Passwd.send_keys (" Eb7bfb8a211ab7980191da0dcfaf5e00 ") Submit.submit () browser.implicitly_wait (3) browser.switch_to_frame (" main ") Browser.find_element_by_partial_link_text ("Website Access Traffic Statistics"). Click () #这个函数主要就是将数据写入文件, not described above: def write (data):        f = open (' paiming ', ' a ')          f.write (data)         f.write (' \ n ')          F.close () def get ():        i = 2         While I < 22:                 Paiming = Browser.find_elEment_by_xpath ("//div[@id = ' top_table ']/table/tbody/tr[%s]/td[1]"%i) .text                 host =  Browser.find_element_by_xpath ("//div[@id = ' top _table ']/table/tbody/tr[%s]/td[2] "%i) .text                 outl = Browser.find_element_by_xpath ("//div[@id = ' top_table ']/table/tbody/tr[%s]/td[3]"%i ) .text                INL = Browser.find_element_by_xpath ("//div[@id = ' top_table ']/table/tbody/tr[%s]/td[4]"%i) .text                 total = Browser.find_element_by_xpath ("/ /div[@id = ' top_table ']/table/tbody/tr[%s]/td[5] "%i) .text                 TT = paiming + ' +  host + ' + outl + ' + inl + ' + Total  &nbs P;             write (TT)                  i + = 1def next_page ():         browser.find_element_by_partial_link_text ("next Page"). Click () j = 1while J < 100:         get ()         next_page ()          J + = # OK, that's it, the implementation of the approximate function of the code is not difficult, because there was no research and development skills, so still engaged for a long time, #ps. Stick to it and get it done ... Let's take a look at the final result: Execute the script and then save the data to a file

Some of the Python crawler's experiences

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More