Python Crawler stock Data crawl

Last Update:2018-03-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous article mentioned some of the possible data related to stock data, and this article goes on to introduce data crawling for multiple web pages. The target captures the full financial data of Ping An Bank (000001) from 1989 to year.

Data Source Analytics Address analysis

Http://money.finance.sina.com.cn/corp/go.php/vFD_FinancialGuideLine/stockid/000001/ctrl/ /displaytype/4.phtml

Open this address in the browser (PC) to see the financial data displayed. This address is a common format:
(1)000001: Is the stock code, replaced by other code corresponding to other stock financial data;
(2): It is the year of financial data, where 2017 shows financial data for 2017 years. The December 2017 financial data will not be out until March 18, but not for the time being. Other year data can be viewed by replacing the corresponding year.

Crawl Analysis

(1) Right-click to view the source code, you can see the target financial data in the source. So the core method is the same. The key content XPath here is:table id = "BALANCESHEETNEWTABLE0".

(2) different from the previous capture dividend data, the dividend is only one page, where the multi-page data, and the number of pages is not uniform. This score red data crawl to solve more than two problems: first, to put the data of different years together, for splicing. Second, determine when the oldest year is and when to stop crawling.

Reptile Program

Operating environment: WIN10 system; Python 3.0;sublime text editor;
(1) first on the procedure. As if the source effect, first, the relevant instructions see code comments. source see the back.

The green box in this article is an important difference from the previous text.
(i) implemented the merging of previously obtained dataframe format data (Dataarr) with the latest obtained Dataframe format data (DF). take advantage of the pandas package's own function:

Dataarr = [DATAARR,DF]
Dataarr = Pd.concat (dataarr,axis=1,join= ' inner ')

(ii) Simultaneous use of a combination with NULL data will result in an anomaly, with initial and end data being judged. See blue Box code.

(2) Operation result. Only partially shown here. More processing is needed to take advantage of this data.

Http://www.aibbt.com/a/18042.html

Summary

Python crawler programming is very concise, the core code can only take a few lines to crawl the desired data. The others are auxiliary, making the function easier to use, or more intuitive in the data.
The Dataframe data merging in the same format is realized by using the concat of pandas, and the starting and ending judgment of this anomaly is also used.
On the beginning and the end of a better way to judge, in fact, in the source code can be crawled to all the years of data address, one-to-crawl.

Source

Import PandasAs PDImport lxml.htmlFrom lxmlImport etreeImport NumPyAs NPFrom pandas.io.htmlImport read_htmlFrom Pandas.compatImport StringioTryFrom Urllib.requestImport Urlopen, RequestExcept Importerror:From Urllib2Import Urlopen, RequestImport timeImport Sys#地址模板FINIANCE_SINA_URL =' Http://money.finance.sina.com.cn/corp/go.php/vFD_FinancialGuideLine/stockid/%s/ctrl/%s/displaytype/4.phtml ';def read_html_sina_finiance1 (code): Has_data =True#获取当前年份 today = Pd.to_datetime (Time.strftime ("%x")) year = Today.year#数据用pandas的dataframe储存 Dataarr = PD. DataFrame ()While Has_data:#新浪财经网页数据 furl = finiance_sina_url% (code,year)#获取数据, standard processing method request = Request (Furl) Text = Urlopen (Request, timeout=5). Read () Text = Text.decode (' GBK ') HTML = lxml.html.parse (Stringio (text))#分离目标数据 res = Html.xpath ("//table[@id =\" balancesheetnewtable0\ "]") Sarr = [etree.tostring (node). Decode (' GBK ')For nodeIn Res]#存储文件 Sarr ='. Join (Sarr) Sarr =' <table>%s</table> '%sarr#向前滚动一年 year-=1#对最后一页进行判断, based on whether the data hasTry#将数据读入到dataframe数据个数中; and make a connection; df = read_html (Sarr) [0] Df.columns=range (0,df.shape[ 1])           &NBSP;DF = df.set_index (df.columns[ 0])         &NBS P  dataarr = [Dataarr, DF]            dataarr = Pd.concat (Dataarr, axis=  1, JOIN=
                                                     
                                                       ' inner ')         
                                                       except:              if (year+  1) ==to Day.year:                has_data=  True           &N Bsp  else:                has_data=  False    dataarr = dataarr.t & nbsp    Try:        dataarr = dataarr.set_index (dataarr.columns[ 0])      Except:        dataarr=dataarr      return dataarrtest = Read_html_sina_finiance1 ( span> ' 000001 ') print (test)

Python Crawler stock Data crawl

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More