Use selenium webdriver+beautifulsoup+ jump frame, achieve simulation click on the page next page button, crawl Web data

Last Update:2017-10-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Record a fast implementation of the Python crawler, want to crawl Zhongcai network data engine of the new Sanbanxi plate, the company profile of all the shares, the URL is http://data.cfi.cn/data_ndkA0A1934A1935A1986A1995.html.

Relatively simple site different page number of the link is also different, you can see the changes in the link to find the rules, and then generate all the page number corresponding to the link to crawl, but this site in the change page when the link is not changed, so it is intended to observe the second page when the request

Found to use the GET Request method, and the request has curpage this parameter, seemingly control the number of pages, so changed the request link in the parameter value for other values found does not change, so decided to change a method, that is, we mentioned in the title of the use of selenium+ BeautifulSoup implementation of the simulation click on the page of the Next button to achieve page flipping, and crawl the contents of the page.

First, let's do the preparatory work, install the required packages, open the command line, direct pip install Selenium and pip install Beautifulsoup4

Then is the download installs the Chromedriver driver, the URL is as follows Https://sites.google.com/a/chromium.org/chromedriver/downloads, Remember to configure the environment variable or install it directly under the working directory. (You can also use IE, PHANTOMJS, etc.)

Here we first crawl each stock corresponding to the homepage link, the code is as follows (using Python2):

 1 #-*-Coding:utf-8-*-2 from selenium import Webdriver 3 from BS4 import beautifulsoup 4 Import sys 5 reload (SYS) 6 s Ys.setdefaultencoding (' Utf-8 ') 7 8 def Crawl (URL): 9 Driver = Webdriver. Chrome () driver.get (URL) page = 012 lst=[]13 with open ('./url.txt ', ' a ') as f:14 while page &lt ; 234:15 soup = BeautifulSoup (Driver.page_source, "Html.parser") print (soup) urls_ta  g = Soup.find_all (' A ', target= ' _blank ') print (Urls_tag) for I in urls_tag:20 if             i[' href '] not in lst:21 f.write (i[' href ']+ ' \ n ') (lst.append ' href ') 23 Driver.find_element_by_xpath ("//a[contains (Text (), ' next page ')]"). Click () time.sleep (2) return ' Finishe d ' def main (): url = ' HTTP://DATA.CFI.CN/CFIDATA.ASPX?SORTFD=&AMP;SORTWAY=&AMP;CURPAGE=2&AMP;FR=CONTENT&AMP;NDK =a0a1934a1935a1986a1995&xztj=&mystock= ' Crawl (URL) if__name__ = = ' __main__ ': Main ()

Running code discovery always gets an error:

The error here means that you can't find the button you're looking for.

So we went to check the source code of the Web page:

found that the page is divided into different frames, so we guess we should need to jump frame, we need to crawl the link in the frame name "content", so we add a line of code: Driver.switch_to.frame (' content ')

def crawl (URL):    driver = webdriver. Chrome ()    driver.get (URL)    driver.switch_to.frame (' content ')    page = 0    lst=[] with    open ('./ Url.txt ', ' a ') as F: While        page < 234:            soup = BeautifulSoup (Driver.page_source, "Html.parser")            print ( soup)            Urls_tag = Soup.find_all (' A ', target= ' _blank ')            print (Urls_tag) for            i in Urls_tag:                if i[' href '] not in LST:                    f.write (i[' href ']+ ' \ n ')                    lst.append (i[' href '])            driver.find_element_by_xpath ("//a [Contains (text (), ' next page ')] "). Click ()            time.sleep (2)    return ' finished '

At this point, run into:

refer to the post link: http://unclechen.github.io/2016/12/11/python%E5%88%A9%E7%94%A8beautifulsoup+selenium%E8%87%AA%E5%8A%A8%E7% bf%bb%e9%a1%b5%e6%8a%93%e5%8f%96%e7%bd%91%e9%a1%b5%e5%86%85%e5%ae%b9/

Http://www.cnblogs.com/liyuhang/p/6661835.html

Use selenium webdriver+beautifulsoup+ jump frame, achieve simulation click on the page next page button, crawl Web data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use selenium webdriver+beautifulsoup+ jump frame, achieve simulation click on the page next page button, crawl Web data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use selenium webdriver+beautifulsoup+ jump frame, achieve simulation click on the page next page button, crawl Web data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support