Record a fast implementation of the Python crawler, want to crawl Zhongcai network data engine of the new Sanbanxi plate, the company profile of all the shares, the URL is http://data.cfi.cn/data_ndkA0A1934A1935A1986A1995.html.
Relatively simple site different page number of the link is also different, you can see the changes in the link to find the rules, and then generate all the page number corresponding to the link to crawl, but this site in the change page when the link is not changed, so it is intended to observe the second page when the request
Found to use the GET Request method, and the request has curpage this parameter, seemingly control the number of pages, so changed the request link in the parameter value for other values found does not change, so decided to change a method, that is, we mentioned in the title of the use of selenium+ BeautifulSoup implementation of the simulation click on the page of the Next button to achieve page flipping, and crawl the contents of the page.
First, let's do the preparatory work, install the required packages, open the command line, direct pip install Selenium and pip install Beautifulsoup4
Then is the download installs the Chromedriver driver, the URL is as follows Https://sites.google.com/a/chromium.org/chromedriver/downloads, Remember to configure the environment variable or install it directly under the working directory. (You can also use IE, PHANTOMJS, etc.)
Here we first crawl each stock corresponding to the homepage link, the code is as follows (using Python2):
1 #-*-Coding:utf-8-*-2 from selenium import Webdriver 3 from BS4 import beautifulsoup 4 Import sys 5 reload (SYS) 6 s Ys.setdefaultencoding (' Utf-8 ') 7 8 def Crawl (URL): 9 Driver = Webdriver. Chrome () driver.get (URL) page = 012 lst=[]13 with open ('./url.txt ', ' a ') as f:14 while page < ; 234:15 soup = BeautifulSoup (Driver.page_source, "Html.parser") print (soup) urls_ta g = Soup.find_all (' A ', target= ' _blank ') print (Urls_tag) for I in urls_tag:20 if i[' href '] not in lst:21 f.write (i[' href ']+ ' \ n ') (lst.append ' href ') 23 Driver.find_element_by_xpath ("//a[contains (Text (), ' next page ')]"). Click () time.sleep (2) return ' Finishe d ' def main (): url = ' HTTP://DATA.CFI.CN/CFIDATA.ASPX?SORTFD=&SORTWAY=&CURPAGE=2&FR=CONTENT&NDK =a0a1934a1935a1986a1995&xztj=&mystock= ' Crawl (URL) if__name__ = = ' __main__ ': Main ()
Running code discovery always gets an error:
The error here means that you can't find the button you're looking for.
So we went to check the source code of the Web page:
found that the page is divided into different frames, so we guess we should need to jump frame, we need to crawl the link in the frame name "content", so we add a line of code: Driver.switch_to.frame (' content ')
def crawl (URL): driver = webdriver. Chrome () driver.get (URL) driver.switch_to.frame (' content ') page = 0 lst=[] with open ('./ Url.txt ', ' a ') as F: While page < 234: soup = BeautifulSoup (Driver.page_source, "Html.parser") print ( soup) Urls_tag = Soup.find_all (' A ', target= ' _blank ') print (Urls_tag) for i in Urls_tag: if i[' href '] not in LST: f.write (i[' href ']+ ' \ n ') lst.append (i[' href ']) driver.find_element_by_xpath ("//a [Contains (text (), ' next page ')] "). Click () time.sleep (2) return ' finished '
At this point, run into:
refer to the post link: http://unclechen.github.io/2016/12/11/python%E5%88%A9%E7%94%A8beautifulsoup+selenium%E8%87%AA%E5%8A%A8%E7% bf%bb%e9%a1%b5%e6%8a%93%e5%8f%96%e7%bd%91%e9%a1%b5%e5%86%85%e5%ae%b9/
Http://www.cnblogs.com/liyuhang/p/6661835.html
Use selenium webdriver+beautifulsoup+ jump frame, achieve simulation click on the page next page button, crawl Web data