Today, big data boss gave me a task-grabbing stock history data. As a wget, I looked for it on the Internet and found that it was really a very powerful Linux download tool. I have been deeply shocked. The following is a description of some of today's processes, or more bumpy.
First, I use the company's existing stock data to query all stock codes and import them locally using hive:
" Use stock;select distinct secucode from T_stock_tick_shsz where type= ' sz '; " >>"usestock;select distinct secucode from T_stock_tick_shsz where type= ' sh '; " >> sh_secucode.txt
PS: The above step, because of a small problem--start without keyword distinct, resulting in late crawl data caught a lot of duplicate stock code data.
Just started to lazy, want to paste wget a sentence, but the stock code too much, so write the script, Shell script is as follows:
#下载上海交易所股票历史记录 #!/bin/bash
forIinch' Cat sh_secucode.txt ' Dowget--user-agent="mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/534.16 (khtml, like Gecko) chrome/10.0.648.204 safari/534.16" -NV--tries=5--timeout=5-o/home/bigdata/script/zj/sh_history/history_data/$I. csv http://quotes.money.163.com/service/chddata.html?code=0$i&end=20130430sleep 1s done #下载深圳交易所股票历史记录 #!/bin/Bash forIinch' Cat sz_secucode.txt ' Dowget--user-agent="mozilla/5.0 (Windows; U Windows NT 5.1; En-us; rv:1.9.2.3) gecko/20100401 firefox/3.6.3 (. NET CLR 3.5.30729)" -NV--tries=5--timeout=5-o/home/bigdata/script/zj/sz_history/history_data/$I. csv http://quotes.money.163.com/service/chddata.html?code=1$i&end=20130430sleep 1s done
PS: Say the above code, why the wget have user-agent this parameter? The students who have played reptiles must know, when you frequently download a website, this site will recognize that this is a crawler, so you have to refuse to download the resources of his home, so to set up a proxy, disguised as a browser to download files, so the probability of being found to laugh. And why do you want to add a sleep? This is because it is possible to have files that are larger and may be suspended after a few milliseconds without downloading. Of course, every file on my side is hundreds of k, so 1s is enough.
Finally, run the script, write this article, the script is still running, hope smooth! O (∩_∩) o
Write a shell script to grab stock history data using wget