Python Data Mining (crawler hardening)

Source: Internet
Author: User
Tags tagname xpath

(I like rainy days, because rainy days I can go back to childhood tread water flowers!) Ha! )

2018--July--12th: Cloudy and heavy rain t-t

Objective

I would like to introduce the ultimate crawler, this as long as we can see the naked eye, even in the source code or in JSON, or post can not be obtained data, we will be able to obtain, the same F12 after the source of exposure in front of you!

This time we need to use some members of the selenium family, each with their own duties and great talent .

Let's introduce the Selenium module: Selenium is a suite of tools specifically for automating Web browsers. (Selenium is a set of tools dedicated to automating Web browsers.) )

---: (here to add, in the need to use this large module, it is necessary to read the official documents, do not listen to Baidu inside and some out-of-context people write the content, they are more likely to take you biased.) )

This selenium module is mainly to deal with the need for us to automatically browse the Web data, let the program to perform semi-intelligent, if you want to teach it to do something!

This is a direct introduction to the required family modules:

1  from Import Webdriver 2 Import  Time 3  from Import Keys 4  from Import Actionchains 5  from Import By

One, every explanation of Kazakhstan, in order to check:

1, the main module embedding, is mainly to deal with the control program automatically open browser browsing the Web page function.

2, as a developer, especially for Web page Automation test development tools must need time module to control the program Access times, because the site may directly put your IP off.

3, Selenium module family members keys, this member should be simulated keyboard operation, should be the analog input user login name and password, or value data index input.

4, Selenium module family member Actionchains, it is to deal with the simulation mouse operation, double-click with the mouse, clicking, around the key, we should turn the page, the Search button click function.

5, Selenium module family members by , this is what we want to teach it to do, but also our data mining and use of one of the core value of the function, to deal with the value of data capture.

Second, the development of the preliminary:

1, the Operation program opens the browser and opens the webpage which we need to enter:

1 ' https://www.xxx.com ' 2 driver=webdriver. Chrome ()3driver.get (URL)4 time.sleep (5)5 driver.quit ()

Here you can test yourself, I am using Google's browser, you can try to use Firefox, they have some differences, mainly the difference between the site!

2. Lock tag after entering the page

Html:

1 <DivID= "AAA"class= "BBB"name= "CCC">2     <P></P>3     <P><a></P>4 </Div>

Python:

1element = driver.find_element_by_id ("AAA")2frame = Driver.find_element_by_tag_name ("Div")3Cheese = Driver.find_element_by_name ("CCC")4cheeses = Driver.find_elements_by_class_name ("BBB")5 6 or7 8  fromSelenium.webdriver.common.byImport by9element = Driver.find_element (By=by.id, value="AAA")Tenframe = Driver.find_element (By.tag_name,"Div") OneCheese = driver.find_element (By.name,"CCC") Acheeses = driver.find_elements (By.class_name,"BBB")

Each of these is a lock tag tree, which is defined by Id,class,name,tagname.

1 xpath_class = Driver.find_element_by_xpath ('//div[@class = "BBB"]/p')  2 xpath_id = Driver.find_element_by_xpath ('//div[@id = "AAA"]/p')

This is the generic method, the XPath method, which are all lost to the parsing page content lock tag.

3, processing operation:

When we lock the function key Tag property, we can further operation, such as page change, search function implementation, for the analog keyboard input can refer to my other blog, "Python Automation crawler"

Here we describe the operation of the analog mouse:

1 elem = Driver.find_element_by_xpath ('//a[@id = "TagName"]')2  actionchains (Driver). Double_click (Elem). Perform ()3 time.sleep (3)

Because of the time, I just introduced the left mouse button click on the page, the other how to refer to the Official document: Selenium webdrive

Actionchains: Lock Browser,Double_click lock tag Tree,. Perform (): click Tag Tree

4. Get Value data

The operation here is similar to the XPath syntax:

Driver.find_elements_by_tag_name ('td') [3].textdriver.find_elements_by_tag_name (  'a'). Get_attribute ('href')

Note Here elements, referring to all tag-> a than the tag href, here is the list format, which needs to be traversed.

5, finally a string of complete code:

1  fromSeleniumImportWebdriver2 Import Time3 Importlxml.html as HTML4  fromBs4ImportBeautifulSoup5  fromSelenium.webdriver.common.keysImportKeys6  fromSelenium.webdriver.common.action_chainsImportActionchains7  fromPymongoImportmongoclient,ascending, Descending8  fromSelenium.webdriver.common.byImport by9 defparser ():TenURL ='https://www.xxx.com' OneDriver=Webdriver. Chrome () A driver.get (URL) -Time.sleep (5) -        forIinchRange (1,675): theA = Driver.find_element_by_xpath ('//div[@class = "AAA"]') -TR = A.find_elements_by_tag_name ('TR') -              forJinchXrange (1, Len (tr)): -Quantity = Tr[j].find_elements_by_tag_name ('TD') [3].text +Producturl = Tr[j].find_elements_by_tag_name ('TD') [0].find_elements_by_tag_name ("Div") [1].find_element_by_tag_name ('ul'). Find_element_by_tag_name ('Li'). Find_element_by_tag_name ('a'). Get_attribute ('href') - producturl_db (producturl,quantity) +Elem = Driver.find_element_by_xpath ('//a[@id = "Elenextpage"]') A actionchains (Driver). Double_click (Elem). Perform () atTime.sleep (3) -        -Driver.quit ()
Selenium has a small gub, that is, when using XPath, you have found the parent tag, but this is a lot of parents, such as TR, if you traverse it, looking for TD, then you still use Find_elements_by_tag_name, because that will initialize, You won't find that parent. So here is the need to pay attention to!
Finally, I wish you all the!!!!!

Python Data Mining (crawler hardening)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.