The path to Python crawler growth: capture the stock data of the securities star, and the python securities star
Retrieving data is an essential part of data analysis, and web crawler is an important channel for obtaining data. In view of this, I picked up the Python tool and started the web crawler path.
The version used in this article is python3.5, which is intended to capture all the-share data of the day on the securities star. The program is mainly divided into three parts: the acquisition of web page source code, the extraction of the required content, and the sorting of the obtained results.
1. Obtain the webpage source code
One of the reasons why many people like to use python crawlers is that they are easy to use. Only the following lines of code can capture the source code of most web pages.
Import urllib. requesturl = 'HTTP: // quote.stockstar.com/stock/ranklist_a_3_00001.html' # target website headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) "} # request header of the disguised browser request = urllib. request. request (url = url, headers = headers) # Request Server response = urllib. request. urlopen (request) # Server response content = response. read (). decode ('gbk') # view the source code print (content) in a certain encoding mode # print the page source code
Although it is easy to capture the source code of a page, but a large number of web page source code captured in a website is often blocked by the server, it suddenly feels that the world is full of malicious. So I began to learn how to break through anti-crawler restrictions.
1. Camouflage stray device Header
Many servers use the header sent by the browser to identify whether it is a human user. Therefore, we can construct a request header to send requests to the server by imitating the behavior of the browser. The server will identify some of the parameters to identify whether you are a human User. Many websites will recognize the User-Agent parameter, so it is best to include the request header. Some websites with high vigilance may also identify other parameters, such as using Accept-Language to identify whether you are a human user, and some websites with anti-leeching functions must include the referer parameter.
2. randomly generate UA
You only need to include the User-Agent parameter to capture the page information. However, the server blocks the page information when several consecutive pages are captured. So I decided to simulate different browsers to send requests each time I captured data, and the server identified different browsers through the User-Agent, therefore, each time you crawl a page, you can randomly generate different UA construction headers to request the server,
3. Slow down crawling
Although different browsers are simulated to crawl data, it is found that hundreds of pages of data can be crawled in some time periods, but sometimes only a dozen pages can be crawled, it seems that the server will identify whether you are a human user or web crawler based on your Access frequency. Therefore, every page I crawl takes a random rest for several seconds. After this code is added, a large amount of stock data can be crawled in each time period.
4. Use the proxy IP Address
When the program was successfully tested at the company, it was detected that it was blocked by the server after returning to the dormitory. In a panic, I quickly asked du Niang to learn that the server can identify your IP address and record the number of visits to this IP address. You can use the proxy IP address of the high-availability mode, and constantly change during the capture process, so that the server cannot find out who is really evil. This work has not been completed yet. Please let us know what will happen later.
5. other methods to break through anti-crawler restrictions
Many servers send a cookie file to the browser when receiving the browser request, and then track your access process through the cookie. In order to prevent the server from identifying you as a crawler, we recommend that you use cookies to crawl data together. If you encounter a website where you want to simulate login, you can apply for a large number of accounts to prevent your account from being hacked and then crawl in, the knowledge of simulated login and verification code recognition is involved here... in short, for website owners, some crawlers are really annoying, so we will come up with many ways to restrict crawlers. Therefore, we must pay attention to some etiquette after forcible entry, don't drag people's websites down.
Ii. Extraction of required content
After obtaining the webpage source code, we can extract the data we need. There are many methods to obtain the required information from the source code. using regular expressions is one of the more classic methods. Let's take a look at some of the collected web page source code.
To reduce interference, I first use a regular expression to match the above body parts from the source code of the entire page, and then match the information of each stock from the body part. The Code is as follows.
Pattern = re. compile ('<tbody [\ s \ S] * </tbody>') body = re. findall (pattern, str (content) # match all code pattern = re between <tbody and </tbody>. compile ('> (. *?) <') Stock_page = re. findall (pattern, body [0]) # match> and all information between <
The compile method is the compilation matching mode, and the findall method uses this matching mode to match the required information and return the information in the list mode. There are a lot of regular expression syntaxes. Below I will only list the meanings of the symbols used.
| Syntax |
Description |
| . |
Match any character except the linefeed "\ n" |
| * |
Match the first character 0 times or unlimited times |
| ? |
Match the first character 0 times or once |
| \ S |
Blank character: [<space> \ t \ r \ n \ f \ v] |
| \ S |
Non-blank characters: [^ \ s] |
| [...] |
Character Set, which can be any character in the character set |
| (...) |
The enclosed expression is used as a group, which is generally the content we need to extract. |
There are a lot of regular expressions in syntax. Maybe a single regular expression can extract the content I want to extract. When some code is used to extract the stock subject, it seems more concise to use the xpath expression. It seems that there is a long way to go for page parsing.
Iii. sorting out the results
Through non-Greedy mode (.*?) All data between the match & amp; matches some blank.
Stock_last = stock_total [:] # stock_total: matched stock data for data in stock_total: # stock_last: sorted stock data if data = '': stock_last.remove ('')
Finally, we can print several columns of data to see the effect. The Code is as follows:
Print ('code', '\ t',' abbreviation ',', '\ t', 'latest price',' \ t', 'upgrade ', '\ t', 'ups and downs', '\ t', '5-minute increase') for I in range (0, len (stock_last), 13 ): # The webpage has a total of 13 columns of Data print (stock_last [I], '\ t', stock_last [I + 1], '',' \ t ', stock_last [I + 2], '', '\ t', stock_last [I + 3],'',' \ t', stock_last [I + 4], '', '\ t', stock_last [I + 5])
The printed result is as follows:
The final procedure for capturing all A-share data on the day of the securities star is as follows:
Import urllibimport urllib. requestimport reimport randomimport time # capture the required content user_agent = ["Mozilla/5.0 (Windows NT 10.0; WOW64)", 'mozilla/5.0 (Windows NT 6.3; WOW64 )', 'mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.20.1.64 Safari/6666', 'mozilla/537.11 (Windows NT 5.0; WOW64; trident/7.0; rv: 11.0) like Gecko ', 'mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, Like Gecko) Chrome/28.0.1500.95 Safari/537.36 ', 'mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2 ;. net clr 2.0.50727 ;. net clr 3.5.30729 ;. net clr 3.0.30729; Media Center PC 6.0 ;. NET4.0C; rv: 11.0) like Gecko) ', 'mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1', 'mozilla/5.0 (Windows; u; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3 ', 'mozilla/5.0 (Windows; U; Windo Ws NT 5.1) Gecko/20070803 Firefox/1.5.0.12 ', 'Opera/9.27 (Windows NT 5.2; U; zh-cn)', 'mozilla/5.0 (Macintosh; PPC Mac OS x; U; en) Opera 8.0 ', 'Opera/8.0 (Macintosh; PPC Mac OS X; U; en)', 'mozilla/5.0 (Windows; U; windows NT 5.1; en-US; rv: 1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6 ', 'mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0) ', 'mozilla/4.0 (com Patible; MSIE 8.0; Windows NT 6.1; Trident/4.0) ', 'mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2 ;. net clr 2.0.50727 ;. net clr 3.5.30729 ;. net clr 3.0.30729; Media Center PC 6.0; InfoPath.2 ;. NET4.0C ;. NET4.0E) ', 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Maxthon/4.0.6.2000 Chrome/26.0.1410.43 Safari/8080 ', 'The Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2 ;. net clr 2.0.50727 ;. net clr 3.5.30729 ;. net clr 3.0.30729; Media Center PC 6.0; InfoPath.2 ;. NET4.0C ;. NET4.0E; QQBrowser/7.3.9825.400) ', 'mozilla/5.0 (Windows NT 6.1; WOW64; rv: 21.0) Gecko/20100101 Firefox/21.0', 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.92 Safari/537.1 LBBROWSER ', 'mozilla/ 5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; bidubrow.2.x) ', 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/Listen 0.1132.11 TaoBrowser/3.0 Safari/536.11 '] stock_total = [] # stock_total: stock data on all pages stock_page: stock data on a page for page in range): url = 'HTTP: // quote.stockstar.com/stock/ranklist_a_3_1_'{str (page={'.html 'request = urllib. request. request (url = url, Headers = {"User-Agent": random. choice (user_agent)}) # randomly extracts an element from the user_agent list. try: response = urllib. request. urlopen (request) failed t urllib. error. HTTPError as e: # print ('page = ', page, '', e. code) failed t urllib. error. URLError as e: print ('page = ', page, '', e. reason) content = response. read (). decode ('gbk') # Read the page content print ('get page', page) # print the obtained page number pattern = re. compile ('<tbody [\ s \ S] * </tbody>') body = re. findall (p Attern, str (content) pattern = re. compile ('> (.*?) <') Stock_page = re. findall (pattern, body [0]) # regular match stock_total.extend (stock_page) time. sleep (random. randrange () # sleep for several seconds at random for each page capture. The value can be changed as needed # Delete the blank character stock_last = stock_total [:] # stock_last is the final stock data for data in stock_total: if data = '': stock_last.remove ('') # print some results print ('code ', '\ t',' abbreviation ',', '\ t', 'latest quota',' \ t', 'upgrade', '\ t ', 'increase/fall ',' \ t', '5-minute increase ') for I in range (0, len (stock_last), 13): # The original webpage has 13 columns of data, so the step size is 13 print (stock_last [I], '\ t', stock_last [I + 1], '',' \ t', stock_last [I + 2], '', '\ t', stock_last [I + 3],'',' \ t', stock_last [I + 4], '', '\ t ', stock_last [I + 5])Python crawler instance