Internet financial Crawler How to write-second lesson snowball net stock Crawler (Introduction to Regular expressions)

Source: Internet
Author: User
Tags stock prices

Tutorial Series:

Internet financial Crawler How to write-first lesson peer-to-peer network loan Crawler (XPath primer)

In the last lesson, we went through a peer net loan crawler, in-depth understanding of XPath and how it was written in the final practical use. It is no exaggeration to say that the most important thing for a simple crawler is to use good XPath and the regular expression to be spoken in this lesson.

regular expression , also known as formal notation , General notation (English: Regular Expression, often abbreviated as regex, RegExp, or re in code)

Regular expressions appear in almost every programming language, with an extremely wide range of applications, such as when making Web pages, to determine whether a user is entering a regular mailbox. The regular expression itself is basically in each language is consistent, but the call method may be slightly different, in our teaching the crawler, the regular expression is mainly used in the definition of the list URL and content URL format, is what URL is the list URL, what URL is the content url, what URL directly discarded. This is mainly to improve the crawling efficiency of the entire crawler, to prevent the crawler in the irrelevant URL to spend too long time, of course, if you want to crawl the whole network, you can also do not set.

For those with a little extra money, perhaps the most common investment is stocks, although China's stock market is called a fickle, demon beast frequency. But still relatively poor liquidity, investment threshold of high investment products, with the National credit endorsement of the stock market is still not two investment options. Stock data in many places, we today through the Snowball Market Center, crawl the day of the stock prices of various listed companies.


Open Snowball Market Center:

650) this.width=650; "Src=" http://img.blog.csdn.net/20160521104249820?watermark/2/text/ Ahr0cdovl2jsb2cuy3nkbi5uzxqv/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/dissolve/70/gravity/center "height=" "Width="/>

Wow, suddenly feel that this is the highest in our tutorial. First of all, this page can be a good entry URL, because there are a lot of connections, but in terms of efficiency, although the crawler itself can help us to do a lot of things, but it is better to find the list URL directly will be faster. We continue to look in, you can see this interface:

Https://xueqiu.com/hq#exchange=CN&plate=1_1_0&firstName=1&secondName=1_1&type=sha&page=1

650) this.width=650; "Src=" http://img.blog.csdn.net/20160521104754447?watermark/2/text/ Ahr0cdovl2jsb2cuy3nkbi5uzxqv/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/dissolve/70/gravity/center "height=" "Width="/>


Forgive me really do not understand the stock market, just think this is the list of all stock prices, brick home do not spray ~

Okay, let's take a look at the next page of law.

https://xueqiu.com/hq#exchange=CN&plate=1_1_0&firstName=1&secondName=1_1&type=sha&page=2

Https://xueqiu.com/hq#exchange=CN&plate=1_1_0&firstName=1&secondName=1_1&type=sha&page=3

Looking at the structure of this URL, from the heart to say:

650) this.width=650; "Src=" http://img.blog.csdn.net/20160521105111573?watermark/2/text/ Ahr0cdovl2jsb2cuy3nkbi5uzxqv/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/dissolve/70/gravity/center "/>

OK, let's first extract the regular expression based on this connection, first we select one of the URLs, and then write them intact:

https://xueqiu.com/hq#Exchange=CN&Plate=1_1_0&firstName= 1&secondname=1_1&Type=sha&Page= 2


First we need to transfer the characters that need to be escaped in the regular, because they are in regular expressions. Represents any character, 0 or 1 times for a specified character, so if we want to match the two characters themselves, we must remember to escape them, of course, there are many other characters need to be transferred, However, these two characters are the most common in the URL, and are the easiest place for everyone to be mistaken.

After escaping the string is this:

https://xueqiu\\.com/hq#Exchange=CN&Plate=1_1_0&firstName =1&secondname=1_1&Type=sha&Page=2


As you can see, this URL is not there? There is only one point, so we're going to move the point, and the reason we have two escape characters \ \ is because this piece of text needs to be written into a string, and the string itself needs to be escaped. After the escape is done, look at the common features of different URLs, the similarities between the different URLs is not the same as the number behind the page, the others are the same, that no we just need to rewrite the number behind the page as a regular form, the regular provides some more useful substitution symbols, such as \w for numbers and letters \ D stands for numbers, which are very common, and can be represented by [0-5] in this form. Here we are actually a number of 1 to many digits, so the number is rewritten to \d+, and note that the escape character in the string to be escaped again, the following string:

https://xueqiu\\.com/hq#Exchange=CN&Plate=1_1_0&firstName =1&secondname=1_1&Type=sha&page=\\d+


Finally, an empirical thing to note, generally HTTPS Web site will support HTTP, and even some of the connection will be written in HTTP, so here for the robustness of the program, it is best to modify this section of the HTTP-compatible format, modify the way we allow s to exist or not exist, The regular provides an interval of three characters representing the number of characters, respectively, representing 0 or 1 times, + representing 1 or more times, * representing 0 or more times. Here it is obvious that we should use?:


https?:/ /xueqiu\\.com/hq#Exchange=CN&Plate=1_1_0&firstName=1 &Secondname=1_1&Type=sha&page=\\d+


Note that this question mark is the regular own question mark and does not need to be escaped.

This way we write out the regular expression of the URL of the list page.

In the same way, we write the regular expression of the content page:

https?:/ /XUEQIU\\.COM/S/SH\\D{6}


{6} Here represents 6 bits, which can be expressed in curly braces when the number of digits is determined or the range is determined. Again, because of its own lack of knowledge of stocks, it is assumed that all code is 6-bit.


Writing here feels like it's not far from done, however, when we test we can find that all page URLs are actually generated by JS, through the AJAX request. Naught, but fortunately we have learned something. Don't lose heart, the dawn is after the darkest hour. Let's talk about what to do with these AJAX requests in the next lesson.

Crawler interested in children's shoes can add QQ Group discussion: 342953471.


Internet financial Crawler How to write-second lesson snowball net stock Crawler (Introduction to Regular expressions)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.