Python stock Data Crawler requests, etree, BeautifulSoup learning

Source: Internet
Author: User
Tags xpath

The recent study of stock data back-test (in fact, want to do quantitative trading), but can provide data directly to the API is not very stable (tushare timeout, Yahoo to repair to use, also not very stable)

#雅虎股票数据API的修复包 from
 pandas_datareader import data as PDR
 

Finally, I intend to study the Python crawler, I have heard of the name of the Py reptile, and tried the next I think OK.

Import requests from BS4 import BeautifulSoup import re #步骤1: Get a list of stocks from East Net; #步骤2: Get the stock code one by one, and add the link to the Baidu stock, and finally to access the links to get the shares
Ticket information; #步骤3: Store the results in a file. def gethtmltext (URL, code= "Utf-8"): Try:r = requests.get (URL) r.raise_for_status () #抛出异常 R.enc oding = code# Set encoding format return R.text Except:return "Def getstocklist (LST, stockurl): HTML = getht
    Mltext (Stockurl, "GB2312") #只获取htrm文本. 
        Soup = beautifulsoup (html, ' Html.parser ') #html解析, here to clean up the entire site source code a = Soup.find_all (' a ') #解析页面, find all the a tags for I in a: #a [1] =<a href= "http://finance.eastmoney.com/yaowen.html" target= "_blank" > News </a> #type (a[1))
            = Bs4.element.Tag Try: #找到a标签中的href属性, and determine the link in the middle of the attribute, take the number behind the link to the href = i.attrs[' href '] #a [1].attrs[' href '] = ' http://finance.eastmoney.com/yaowen.html ' #深圳交易所的代码以sz开头, Shanghai exchange's code starts with SH, the stock number has 6-bit structure , so the regular expression can be written as [s][hz]\d{6} lst.append (Re.findall (R "[S][hz]\d{6}", href)[0]) except: #try ... except to handle exceptions to the program continue Def getstockinfo (LST, Stockurl, Fpath):
        Count = 0 for the Lst:url = stockurl + stock + ". html" html = Gethtmltext (URL) #对一只股票进行操作 Try:if html== "": Continue infodict = {} soup = BeautifulSoup (HTML, ' Html.parser ') Stockinfo = Soup.find (' div ', attrs={' class ': ' Stock-bets '}) #find整理成以 <div class= "Stock-bets" ; The entire code # &LT;DIV class= "Stock-bets" > #  
Here's the big question, which is that it only crawls 1 days of data. 

However, as the first reptile program I practiced, I took the middle process of each step as an annotation record and learned it as a note.

Next comes the code that gets the historical data.

Import time import requests from lxml import etree# import re import pandas as PD class Stockcode (object): Def __init __ (self): Self.start_url = "Http://quote.eastmoney.com/stocklist.html#sh" self.headers = {"Us Er-agent ":": mozilla/5.0 (Windows NT 6.1;
        WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/59.0.3071.115 safari/537.36 "} def parse_url (self): # initiate request, get response response = Requests.get (Self.start_url, headers=self.headers) If Response.status_code = 2 00:return etree. HTML (response.content) def get_code_list (self, Response): # Get a list of stock codes node_list = Response.xpath ('//*  [@id = ' quotesearch ']/ul[1]/li ') code_list = [] for node in Node_list:try:code = Re.match (R. *?\ (\d+) \), etree.tostring (node). Decode ()). Group (1) print (code) code_lis T.append (code) except:continue returnCode_list def run (self): HTML = Self.parse_url () return self.get_code_list (HTML) # #下载历史交易记录 class do Wnload_historystock (object): Def __init__ (self, code): Self.code = Code Self.start_url = "Http://quote S.money.163.com/trade/lsjysj_ "+ Self.code +". html "Print (Self.start_url) self.headers = {" User-agent ":": mozilla/5.0 (Windows NT 6.1;
        WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/59.0.3071.115 safari/537.36 "} def parse_url (self):
            Response = Requests.get (self.start_url) print (response.status_code) If Response.status_code = 200: Return etree. HTML (response.content) return False def get_date (self, Response): # Get start and end date start_date = ' '. Join (Response.xpath ('//input[@name = Date_start_type ']/@value ') [0].split ('-')] end_date = '. Join (Response.xpath ('//input[@name = ' date_end_type ']/@value ') [0].split ('-')] return start_dAte,end_date def download (self, start_date, end_date): Download_url = "Http://quotes.money.163.com/service/ch Ddata.html?code=0 "+self.code+" &start= "+start_date+" &end= "+end_date+" &fields=TCLOSE; High; Low; Topen; Lclose; CHG; Pchg; Turnover; Voturnover; Vaturnover; TCAP; Mcap "data = Requests.get (Download_url) with open (' e:/data/historystock/' + Self.code + '. csv ', ' WB ') as F" : For Chunk in Data.iter_content (chunk_size=10000): If Chunk:f.write (Chun
            k) print (' Stock---', self.code, ' Historical data downloading ') def run (self): try:html = Self.parse_url () Start_date,end_date = self.get_date (HTML) self.download (start_date, end_date) except Exception As E:print (e) if __name__ = = ' __main__ ': Code = stockcode () code_list = Code.run () for Temp_code I n dcodes:time.sleep (1) Download = Download_historystock (Temp_code) Download.run ()

followed by some extra operations, as a record

#
CODE_DF=PD. Series (code_list). Astype (' int ')
code_list=code_df[code_df>=600000].astype (' str '). ToList ()

# Breakpoint Lookup directory file name, and code_list do difference set
import os
dir = os.fsencode (' e:/data/historystock/')
codes = [] for
file in Os.listdir (dir):
    filename = os.fsdecode (file)
    code = str (filename[0:6])
    codes.append (code)

Dcodes =list (Set (code_list). Difference (set (codes))
 #读取到本地, written to MySQL dfs=[] for code in Codes:everydf=pd.read_csv (' e:/data/historystock/%s.csv '% Code, encoding= ' GBK '). Sort_values (by = ' Date ') dfs.append (EVERYDF) stock=pd.concat (Dfs) stock.to_csv (' E:/data/stock.c SV ') stock=pd.read_csv (' E:/data/stock.csv ', encoding= ' GBK ') import MySQLdb as MDB from sqlalchemy import Create_engine #s Ec_user:password@localhost/securities_master User: Password @localhost/database name engine = Create_engine (' Mysql://sec_user: Password@localhost/securities_master?charset=utf8 ') # #存入数据库 stock.to_sql (' Historystock ', engine) 
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.