This paper takes Hongyong China as an example, extracts the data and uses the ARIMA algorithm to predict the time series.
Crawl data
# Crawl Line Kanhong China Fund From BS4 import BeautifulSoup Import requests
headers = {' Accept ': ' Text/javascript, Application/javascript, */*; q=0.01 ', ' accept-encoding ': ' gzip, deflate ', ' Accept-language ': ' zh-cn,zh;q=0.8 ', ' Connection ': ' Keep-alive ', ' Cookie ': ' vjuids=148cf0186.15e03abf2ac.0.c311af0ddaa6c; Advs=358187b0bd1a65; asl=17431,000pn,7010519170105191; JRJ_UID=15060593555978DJCIWMVNB; jrj_z3_newsid=723; ADVC=35686F6CAEEDF3; wt_fpc=id=2ef30c6a0af7eaf3a501506059355507:lv=1506063782501:ss=1506063782501; Channelcode=3763bexx; ylbcode=24s2az96; vjlast=1503300154.1506059356.23; hm_lvt_a07bde197b7bf109a325eebaee445939=1506059356; hm_lpvt_a07bde197b7bf109a325eebaee445939=1506063783 ', ' Host ': ' fund.jrj.com.cn ', ' Referer ': ' http://fund.jrj.com.cn/archives,968006,jjjz.shtml ', ' User-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/60.0.3112.90 safari/537.36 ', ' X-requested-with ': ' XMLHttpRequest '}
params = {' Fundcode ': ' 968006 ', ' obj ': ' obj ', ' Date ': 2017}
r = Requests.get (' Http://fund.jrj.com.cn/json/archives/history/netvalue? ', Params=params,headers=headers) r.encoding = ' Utf-8 ' MyData = R.text |
Storing data
# Extract standard JSON format data from a string Table = mydata[8:]
# Convert strings to JSON without manual parsing Myjson = json.loads (table)
# Extract Net Worth data myjson[' Fundhistorynetvalue '] |
From Pymongo import mongoclient
db = Mongoclient (' localhost ', 27017) [' Fund '] Collect = db.get_collection (' hjhy ') Collect.insert (myjson[' Fundhistorynetvalue ') Print (' Done ') |
Extract & Process data
From Pymongo import mongoclient Import Pandas as PD Import Time,datetime
db = Mongoclient (' localhost ', 27017) [' Fund '] data = Dict ()
For item in Db.get_collection (' hjhy '). Find (): Data[datetime.datetime.fromtimestamp (Time.mktime (Time.strptime (item[' enddate '), '%y-%m-%d '))] = item[' accum_net ' ] |
Using the Arima model to predict
1. Build Time Series
# Build Time Series My_series = PD. Series (data, Data.keys ())
# processing data types, converting str to float My_series = my_series.apply (lambda x:float (x))
# Chronological ORDER by date My_series = My_series.sort_index () |
2. View Trend Chart
Since the establishment of the Fund, the trend of price growth has changed.
%pylab # Plot (my_series) My_series.plot () |
The direct use of plot (my_series) will be more than a line to draw the first and last connection. or use My_series.plot () to call the object's own plot method.
3. Perform differential operation
From matplotlib import Pyplot as Plt
# First Order Difference Fig = Plt.figure () diff1 = My_series.diff (1) Diff1.plot ()
# Second Order Difference Fig = Plt.figure () DIFF2 = My_series.diff (2) Diff2.plot () |
4. First-order differential
5. Second Order Difference
6. View descriptive statistics
# first-order differential descriptive statistics Diff1.dropna (Inplace=true) Diff1.describe () |
Each time you do a differential, you will produce an NA, so remember to remove Na. The following results are descriptive statistics for DIFF1:
# second-order difference descriptive statistics Diff2.dropna (Inplace=true) Diff2.describe () |
The following results are descriptive statistics for DIFF2:
So it's enough to make a difference.
7. Determine p, q parameter values
Import Statsmodels.api as SM
Fig = Plt.figure ()
ax0 = Fig.add_subplot (211) Fig = SM.GRAPHICS.TSA.PLOT_ACF (diff1, lags=30, ax=ax0)
Ax1 = Fig.add_subplot (212) Fig = SM.GRAPHICS.TSA.PLOT_PACF (diff1, lags=30, AX=AX1) |
This is the first order difference autocorrelation and partial correlation trend graph, although the first order difference's smoothness is slightly better than the second order difference, but P>0,MR (q) truncated; Q>0,ar (p) truncated.
Choose to use the second-order difference, the autocorrelation and partial correlation trend of the second-order difference is shown below:
5. Forecast
From Statsmodels.tsa.arima_model import Arima
Model = ARIMA (History_price, (2, 1)). Fit ()
Model.forecast (10) [0]
Actual value
Forecast value
Array ([1.41013409, 1.4134152, 1.41570651, 1.41638723, 1.42131414, 1.42299673, 1.42647455, 1.42795939, 1.43 099336, 1.43316138])
Welcome all onlookers, long according to identify two-dimensional code, focus on "data analysis notes" ~