2015 's stock market is now the hot topic, colleague's friend has made a simple bomb stock to spit the Trough single page PV can reach the 30w+, corresponds to this blog one year PV quantity. So stand on the technical point of view, here also write a few articles on the technical aspects of stocks. First of all, we start with a list of a shares.
Objective: To obtain a A-share list for the current SSE and SSE.
First, the official station to obtain
There are two official stations:
1, Shanghai SSE official website
2, SSE official website
The difference is that the Shenzhen stock supply directly provides the way Excel exports
and SSE compare egg pain, not directly provide download page, so need to pick up through the page, after the page analysis found that all of its stock market list in the JS file. As follows:
Http://www.sse.com.cn/js/common/ssesuggestdata.js (A-share + b)
Http://www.sse.com.cn/js/common/ssesuggestEbonddata.js (Transfer debt)
Because only concern a a-shares, so here only to take the JS file above the 60 in the beginning of the stock. The JS file can be obtained by curl or wget and can be obtained by simple shell processing:
# JS files in the data format
function Get_data () {
var _t = new Array ();
_t.push ({val: "600000", Val2: "Pudong FA Bank", Val3: "Pfyx"});
_t.push ({val: "600004", Val2: "Baiyun Airport", Val3: "BYJC"});
_t.push ({val: "600005", Val2: "Wisco Shares", Val3: "WGGF"});
_t.push ({val: "600006", Val2: "Dongfeng Motor", Val3: "DFQC"});
..............................
Format after #shell statement processing
# by the Road of Transportation (www.361way.com)
[Root@361way ~]# wget http://www.sse.com.cn/js/common/ssesuggestdata.js
[Root@361way ~]# grep push ssesuggestdata.js |sed s/\[val2, '} ', ', \ ', val3\]//g|awk-f: ' {print $2,$3,$4} ' |grep ^60
600000 Pudong FA Bank Pfyx
600004 Baiyun Airport BYJC
600005 Wisco Stock WGGF
600006 Dongfeng Motor DFQC
........................
So this approach is relatively simple and quick to obtain. Of course, you can also use the Selenium + Python simulation browser to access the pick up. I'll talk about it alone.
Second, the third party site acquisition
The official station obtains the method, needs to take the data from two official stations separately, but the third party station many will turn to two official stations to pay "the protection fee", therefore may obtain the data directly through the API, and may the deep Shanghai Two cities ' A shares data wrapping together. Domestic do relatively good mainly has the following four:
1, Tencent Securities--http://stockapp.finance.qq.com/mstats/#mod =list
2, Sina Finance--http://finance.sina.com.cn/data/#stock-schq-hsgs
3, Phoenix Finance-Http://app.finance.ifeng.com/list/stock.php?t=ha
4, East Net--http://quote.eastmoney.com/center/list.html#33
The four penguins do the most humane, in addition to supporting a variety of sorting, but also support Excel export. Directly is the Shanghai and Shenzhen stock A shares direct export. Although not always like this fat penguin, but the fact is, indeed do a good job. The other three will need web crawling.
Third, selenium + python capture data
This is the dumbest of the two methods, and the slowest way to get data. No last resort, do not recommend this method (can use request, URLIB2 and other modules as far as possible), but because the selenium module is really cow B, more for automated testing and the best crawling environment, right here as learning. First code:
[Root@localhost stock]# Cat get_sh.py
#-*-Encoding:utf-8-*-
# by the Road of Transportation (361way.com)
Import Sys
Import Cpickle
#import Pickle
Import Selenium
From Pyvirtualdisplay import Display
display = display (visible=0, size= (1024, 768))
Display.start ()
From Selenium.webdriver.support.ui import webdriverwait # available since 2.4.0
# from Selenium.common.exceptions import timeoutexception
# from Selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
def wait_condition_01 (Driver):
Return driver.find_element_by_id (' Datelist_container_pageid ')
def extract_table (Driver, stocklist):
tag_table= driver.find_element_by_class_name ("TableStyle")
tabletext= Tag_table.text
Stocklist.extend (tabletext.split (' \ n ') [1:])
Driver= Selenium.webdriver.Firefox ()
Driver.get ("http://www.sse.com.cn/assortment/stock/list/name/")
stocklist= []
Extract_table (Driver=driver, Stocklist=stocklist)
Tag_meta= driver.find_element_by_id ("Staticpagination")
attr_total= Int (tag_meta.get_attribute ("Total"))
attr_pagecount= Int (Tag_meta.get_attribute ("PageCount"))
# page to extract content
For PAGENR in range (2, attr_pagecount+1):
id_input= ' Datelist_container_pageid ' if Pagenr > 2 Else ' Xsgf_pageid '
id_button= ' Datelist_container_togo ' if Pagenr > 2 Else ' Xsgf_togo '
tag_input= driver.find_element_by_id (Id_input)
tag_button= driver.find_element_by_id (Id_button)
Tag_input.send_keys (str (PAGENR))
Tag_button.click ()
Webdriverwait (Driver). Until (wait_condition_01)
Extract_table (Driver=driver, Stocklist=stocklist)
# Send results to Keynote process
data= {
' Total stocks ': attr_total,
' Stock List ': Stocklist,
}
Driver.quit ()
#pdata = pickle.dumps (data, protocol=2)
Pdata= cpickle.dumps (data, protocol=2)
Sys.stdout.write (pdata + b ' \ n ')
The following issues may be encountered during use:
Issue 1: Direct Selenium + python error
After use, the error is as follows:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/site-packages/selenium/webdriver/firefox/webdriver.py", line, in __init__
Self.binary, timeout),
File "/usr/lib/python2.6/site-packages/selenium/webdriver/firefox/extension_connection.py", line Wuyi, in __init__
Self.binary.launch_browser (Self.profile)
File "/usr/lib/python2.6/site-packages/selenium/webdriver/firefox/firefox_binary.py", line, in Launch_browser
Self._wait_until_connectable ()
File "/usr/lib/python2.6/site-packages/selenium/webdriver/firefox/firefox_binary.py", line, in _wait_until_ connectable
Raise Webdriverexception ("The browser appears to have exited"
Selenium.common.exceptions.WebDriverException:Message:The Browser appears to have exited before we connect. If you are specified a log_file in the Firefoxbinary constructor, check it for details.
The workaround is to join the Pyvirtualdisplay module and invoke the following:
#!/usr/bin/env python
From Pyvirtualdisplay import Display
From selenium import Webdriver
display = display (visible=0, size= (1024, 768))
Display.start ()
Browser = Webdriver. Firefox ()
Browser.get (' http://www.111cn.net/')
Print Browser.page_source
Browser.close ()
Display.stop ()
Problem 2:selenium + python + pyvirtualdisplay error
The error content is as follows:
>>> from Pyvirtualdisplay import Display
>>> from selenium import webdriver
>>> display = display (visible=0, size= (1024, 768))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/site-packages/pyvirtualdisplay/display.py", line, in __init__
Self._obj = Self.display_class (
File "/usr/lib/python2.6/site-packages/pyvirtualdisplay/display.py", line Wuyi, in Display_class
Cls.check_installed ()
File "/usr/lib/python2.6/site-packages/pyvirtualdisplay/xvfb.py", line, in check_installed
ubuntu_package=package). check_installed ()
File "/usr/lib/python2.6/site-packages/easyprocess/__init__.py", line 209, in check_installed
Raise Easyprocesscheckinstallederror (self)
Easyprocess. easyprocesscheckinstallederror:cmd=[' Xvfb ', '-help ']
Oserror=[errno 2] No such file or directory
Program Install error!
From the PyPI site, you will need to use either XVFB, Xephyr, and Xvnc at the back end. Here is the first one, the following method installation:
#centos下
Yum-y insatll XORG-X11-SERVER-XVFB
#ubuntu下
sudo apt-get install Xvfb
The data can be obtained normally through Python get_sh.py. The extracted list data is not straightforward and needs to be dealt with in a near-step.