Use selenium webdriver+beautifulsoup+ jump frame, achieve simulation click on the page next page button, crawl Web data

Source: Internet
Author: User

Record a fast implementation of the Python crawler, want to crawl Zhongcai network data engine of the new Sanbanxi plate, the company profile of all the shares, the URL is http://data.cfi.cn/data_ndkA0A1934A1935A1986A1995.html.

Relatively simple site different page number of the link is also different, you can see the changes in the link to find the rules, and then generate all the page number corresponding to the link to crawl, but this site in the change page when the link is not changed, so it is intended to observe the second page when the request

Found to use the GET Request method, and the request has curpage this parameter, seemingly control the number of pages, so changed the request link in the parameter value for other values found does not change, so decided to change a method, that is, we mentioned in the title of the use of selenium+ BeautifulSoup implementation of the simulation click on the page of the Next button to achieve page flipping, and crawl the contents of the page.

First, let's do the preparatory work, install the required packages, open the command line, direct pip install Selenium and pip install Beautifulsoup4

Then is the download installs the Chromedriver driver, the URL is as follows Https://sites.google.com/a/chromium.org/chromedriver/downloads, Remember to configure the environment variable or install it directly under the working directory. (You can also use IE, PHANTOMJS, etc.)

Here we first crawl each stock corresponding to the homepage link, the code is as follows (using Python2):

 1 #-*-Coding:utf-8-*-2 from selenium import Webdriver 3 from BS4 import beautifulsoup 4 Import sys 5 reload (SYS) 6 s Ys.setdefaultencoding (' Utf-8 ') 7 8 def Crawl (URL): 9 Driver = Webdriver. Chrome () driver.get (URL) page = 012 lst=[]13 with open ('./url.txt ', ' a ') as f:14 while page &lt ; 234:15 soup = BeautifulSoup (Driver.page_source, "Html.parser") print (soup) urls_ta  g = Soup.find_all (' A ', target= ' _blank ') print (Urls_tag) for I in urls_tag:20 if             i[' href '] not in lst:21 f.write (i[' href ']+ ' \ n ') (lst.append ' href ') 23 Driver.find_element_by_xpath ("//a[contains (Text (), ' next page ')]"). Click () time.sleep (2) return ' Finishe d ' def main (): url = ' HTTP://DATA.CFI.CN/CFIDATA.ASPX?SORTFD=&SORTWAY=&CURPAGE=2&FR=CONTENT&NDK =a0a1934a1935a1986a1995&xztj=&mystock= ' Crawl (URL) if__name__ = = ' __main__ ': Main () 

Running code discovery always gets an error:

The error here means that you can't find the button you're looking for.

So we went to check the source code of the Web page:

found that the page is divided into different frames, so we guess we should need to jump frame, we need to crawl the link in the frame name "content", so we add a line of code: Driver.switch_to.frame (' content ')

def crawl (URL):    driver = webdriver. Chrome ()    driver.get (URL)    driver.switch_to.frame (' content ')    page = 0    lst=[] with    open ('./ Url.txt ', ' a ') as F: While        page < 234:            soup = BeautifulSoup (Driver.page_source, "Html.parser")            print ( soup)            Urls_tag = Soup.find_all (' A ', target= ' _blank ')            print (Urls_tag) for            i in Urls_tag:                if i[' href '] not in LST:                    f.write (i[' href ']+ ' \ n ')                    lst.append (i[' href '])            driver.find_element_by_xpath ("//a [Contains (text (), ' next page ')] "). Click ()            time.sleep (2)    return ' finished '

At this point, run into:

refer to the post link: http://unclechen.github.io/2016/12/11/python%E5%88%A9%E7%94%A8beautifulsoup+selenium%E8%87%AA%E5%8A%A8%E7% bf%bb%e9%a1%b5%e6%8a%93%e5%8f%96%e7%bd%91%e9%a1%b5%e5%86%85%e5%ae%b9/

Http://www.cnblogs.com/liyuhang/p/6661835.html

Use selenium webdriver+beautifulsoup+ jump frame, achieve simulation click on the page next page button, crawl Web data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.