Python Data Mining (crawler hardening)

Last Update:2018-07-12 Source: Internet

Author: User

Tags tagname xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(I like rainy days, because rainy days I can go back to childhood tread water flowers!) Ha! ）

2018--July--12th: Cloudy and heavy rain t-t

Objective

I would like to introduce the ultimate crawler, this as long as we can see the naked eye, even in the source code or in JSON, or post can not be obtained data, we will be able to obtain, the same F12 after the source of exposure in front of you!

This time we need to use some members of the selenium family, each with their own duties and great talent .

Let's introduce the Selenium module: Selenium is a suite of tools specifically for automating Web browsers. (Selenium is a set of tools dedicated to automating Web browsers.) ）

---: (here to add, in the need to use this large module, it is necessary to read the official documents, do not listen to Baidu inside and some out-of-context people write the content, they are more likely to take you biased.) ）

This selenium module is mainly to deal with the need for us to automatically browse the Web data, let the program to perform semi-intelligent, if you want to teach it to do something!

This is a direct introduction to the required family modules:

1  from Import Webdriver 2 Import  Time 3  from Import Keys 4  from Import Actionchains 5  from Import By

One, every explanation of Kazakhstan, in order to check:

1, the main module embedding, is mainly to deal with the control program automatically open browser browsing the Web page function.

2, as a developer, especially for Web page Automation test development tools must need time module to control the program Access times, because the site may directly put your IP off.

3, Selenium module family members keys, this member should be simulated keyboard operation, should be the analog input user login name and password, or value data index input.

4, Selenium module family member Actionchains, it is to deal with the simulation mouse operation, double-click with the mouse, clicking, around the key, we should turn the page, the Search button click function.

5, Selenium module family members by , this is what we want to teach it to do, but also our data mining and use of one of the core value of the function, to deal with the value of data capture.

Second, the development of the preliminary:

1, the Operation program opens the browser and opens the webpage which we need to enter:

1 ' https://www.xxx.com ' 2 driver=webdriver. Chrome ()3driver.get (URL)4 time.sleep (5)5 driver.quit ()

Here you can test yourself, I am using Google's browser, you can try to use Firefox, they have some differences, mainly the difference between the site!

2. Lock tag after entering the page

Html:

1 <DivID= "AAA"class= "BBB"name= "CCC">2     <P></P>3     <P><a></P>4 </Div>

Python:

1element = driver.find_element_by_id ("AAA")2frame = Driver.find_element_by_tag_name ("Div")3Cheese = Driver.find_element_by_name ("CCC")4cheeses = Driver.find_elements_by_class_name ("BBB")5 6 or7 8  fromSelenium.webdriver.common.byImport by9element = Driver.find_element (By=by.id, value="AAA")Tenframe = Driver.find_element (By.tag_name,"Div") OneCheese = driver.find_element (By.name,"CCC") Acheeses = driver.find_elements (By.class_name,"BBB")

Each of these is a lock tag tree, which is defined by Id,class,name,tagname.

1 xpath_class = Driver.find_element_by_xpath ('//div[@class = "BBB"]/p')  2 xpath_id = Driver.find_element_by_xpath ('//div[@id = "AAA"]/p')

This is the generic method, the XPath method, which are all lost to the parsing page content lock tag.

3, processing operation:

When we lock the function key Tag property, we can further operation, such as page change, search function implementation, for the analog keyboard input can refer to my other blog, "Python Automation crawler"

Here we describe the operation of the analog mouse:

1 elem = Driver.find_element_by_xpath ('//a[@id = "TagName"]')2  actionchains (Driver). Double_click (Elem). Perform ()3 time.sleep (3)

Because of the time, I just introduced the left mouse button click on the page, the other how to refer to the Official document: Selenium webdrive

Actionchains: Lock Browser,Double_click lock tag Tree,. Perform (): click Tag Tree

4. Get Value data

The operation here is similar to the XPath syntax:

Driver.find_elements_by_tag_name ('td') [3].textdriver.find_elements_by_tag_name (  'a'). Get_attribute ('href')

Note Here elements, referring to all tag-> a than the tag href, here is the list format, which needs to be traversed.

5, finally a string of complete code:

1  fromSeleniumImportWebdriver2 Import Time3 Importlxml.html as HTML4  fromBs4ImportBeautifulSoup5  fromSelenium.webdriver.common.keysImportKeys6  fromSelenium.webdriver.common.action_chainsImportActionchains7  fromPymongoImportmongoclient,ascending, Descending8  fromSelenium.webdriver.common.byImport by9 defparser ():TenURL ='https://www.xxx.com' OneDriver=Webdriver. Chrome () A driver.get (URL) -Time.sleep (5) -        forIinchRange (1,675): theA = Driver.find_element_by_xpath ('//div[@class = "AAA"]') -TR = A.find_elements_by_tag_name ('TR') -              forJinchXrange (1, Len (tr)): -Quantity = Tr[j].find_elements_by_tag_name ('TD') [3].text +Producturl = Tr[j].find_elements_by_tag_name ('TD') [0].find_elements_by_tag_name ("Div") [1].find_element_by_tag_name ('ul'). Find_element_by_tag_name ('Li'). Find_element_by_tag_name ('a'). Get_attribute ('href') - producturl_db (producturl,quantity) +Elem = Driver.find_element_by_xpath ('//a[@id = "Elenextpage"]') A actionchains (Driver). Double_click (Elem). Perform () atTime.sleep (3) -        -Driver.quit ()

Selenium has a small gub, that is, when using XPath, you have found the parent tag, but this is a lot of parents, such as TR, if you traverse it, looking for TD, then you still use Find_elements_by_tag_name, because that will initialize, You won't find that parent. So here is the need to pay attention to!
Finally, I wish you all the!!!!!

Python Data Mining (crawler hardening)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More