Python Crawler stock Data crawl

Source: Internet
Author: User
Tags xpath sublime text sublime text editor

The previous article mentioned some of the possible data related to stock data, and this article goes on to introduce data crawling for multiple web pages. The target captures the full financial data of Ping An Bank (000001) from 1989 to year.

Data Source Analytics Address analysis

Http://money.finance.sina.com.cn/corp/go.php/vFD_FinancialGuideLine/stockid/000001/ctrl/ /displaytype/4.phtml

Open this address in the browser (PC) to see the financial data displayed. This address is a common format:
(1)000001: Is the stock code, replaced by other code corresponding to other stock financial data;
(2): It is the year of financial data, where 2017 shows financial data for 2017 years. The December 2017 financial data will not be out until March 18, but not for the time being. Other year data can be viewed by replacing the corresponding year.

Crawl Analysis

(1) Right-click to view the source code, you can see the target financial data in the source. So the core method is the same. The key content XPath here is:table id = "BALANCESHEETNEWTABLE0".

(2) different from the previous capture dividend data, the dividend is only one page, where the multi-page data, and the number of pages is not uniform. This score red data crawl to solve more than two problems: first, to put the data of different years together, for splicing. Second, determine when the oldest year is and when to stop crawling.

Reptile Program

Operating environment: WIN10 system; Python 3.0;sublime text editor;
(1) first on the procedure. As if the source effect, first, the relevant instructions see code comments. source see the back.

The green box in this article is an important difference from the previous text.
(i) implemented the merging of previously obtained dataframe format data (Dataarr) with the latest obtained Dataframe format data (DF). take advantage of the pandas package's own function:

Dataarr = [DATAARR,DF]
Dataarr = Pd.concat (dataarr,axis=1,join= ' inner ')

(ii) Simultaneous use of a combination with NULL data will result in an anomaly, with initial and end data being judged. See blue Box code.

(2) Operation result. Only partially shown here. More processing is needed to take advantage of this data.

Http://www.aibbt.com/a/18042.html

Summary

Python crawler programming is very concise, the core code can only take a few lines to crawl the desired data. The others are auxiliary, making the function easier to use, or more intuitive in the data.
The Dataframe data merging in the same format is realized by using the concat of pandas, and the starting and ending judgment of this anomaly is also used.
On the beginning and the end of a better way to judge, in fact, in the source code can be crawled to all the years of data address, one-to-crawl.

Source
Import PandasAs PDImport lxml.htmlFrom lxmlImport etreeImport NumPyAs NPFrom pandas.io.htmlImport read_htmlFrom Pandas.compatImport StringioTryFrom Urllib.requestImport Urlopen, RequestExcept Importerror:From Urllib2Import Urlopen, RequestImport timeImport Sys#地址模板FINIANCE_SINA_URL =' Http://money.finance.sina.com.cn/corp/go.php/vFD_FinancialGuideLine/stockid/%s/ctrl/%s/displaytype/4.phtml ';def read_html_sina_finiance1 (code): Has_data =True#获取当前年份 today = Pd.to_datetime (Time.strftime ("%x")) year = Today.year#数据用pandas的dataframe储存 Dataarr = PD. DataFrame ()While Has_data:#新浪财经网页数据 furl = finiance_sina_url% (code,year)#获取数据, standard processing method request = Request (Furl) Text = Urlopen (Request, timeout=5). Read () Text = Text.decode (' GBK ') HTML = lxml.html.parse (Stringio (text))#分离目标数据 res = Html.xpath ("//table[@id =\" balancesheetnewtable0\ "]") Sarr = [etree.tostring (node). Decode (' GBK ')For nodeIn Res]#存储文件 Sarr ='. Join (Sarr) Sarr =' <table>%s</table> '%sarr#向前滚动一年 year-=1#对最后一页进行判断, based on whether the data hasTry#将数据读入到dataframe数据个数中; and make a connection; df = read_html (Sarr) [0] Df.columns=range (0,df.shape[ 1])           &NBSP;DF = df.set_index (df.columns[ 0])         &NBS P  dataarr = [Dataarr, DF]            dataarr = Pd.concat (Dataarr, axis=  1, JOIN=
                                                     
                                                       ' inner ')         
                                                       except:              if (year+  1) ==to Day.year:                has_data=  True           &N Bsp  else:                has_data=  False    dataarr = dataarr.t & nbsp    Try:        dataarr = dataarr.set_index (dataarr.columns[ 0])      Except:        dataarr=dataarr      return dataarrtest = Read_html_sina_finiance1 ( span> ' 000001 ') print (test)           
                                                       

Python Crawler stock Data crawl

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.