How do I use Python to crawl a snowball Web page?

Last Update:2016-06-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I want to use BeautifulSoup or some other Python package to grab some of the combinations on the snowball webpage, because the snow net will not give a hint when the combination of the snowball net position changes, for example, I want to crawl this / http xueqiu.com/p/zh010389 。 The basic idea is to use a program to track his position, and then to change it, the program gives me a hint.

# #简而言之, the thing to do is: Open the interface, then open the switch history of the interface, then record his current position, compare the previous position. ##

The problem is: since I don't know much about HTML, I don't know how to get my app to open the history of his silo when I open Chrome's developer tools ...

This problem may be relatively small white ... Trouble!!!

Reply content:

Many people say that attention now has a reminder ... Well, the main question is obviously not that feature. I write this just my own in the process of learning the crawler exercises. I do not stock stocks and not on snowballs ...

A lot of praise. Let me give Amway a piece of my own answer how to get started with Python crawler? -Chaixiao's answer

Write while you are doing it ~
#start coding
First of all to know what they are climbing ~ Landlord said to find HTML code and so on, the idea is actually wrong. Because the content we want is not in the original HTML. But certainly in the communication between the browser and the server, we just need to find this part of the data is good.
#我用的是Firefox的FireBug
Select the network (it should be in Chrome), click the Position history,

You can see a communication between the browser and the server. We intercepted a Web site. Open it and look.

It looks like a mess, but careful words will find ...

This means that the data we want is here, so just get the content of the page and extract the data.

Import Urllib.requesturl = ' http://xueqiu.com/cubes/rebalancing/history.json?cube_symbol=ZH010389&count=20 &page=1 ' req = urllib.request.Request (url,headers=headers) HTML = Urllib.request.urlopen (req). Read (). Decode (' Utf-8 ') print (HTML)

Now focus on a combination, there will be a hint of changes in the position. But I think it's kind of interesting. For example, a lot of open data can be captured, do some comprehensive analysis, to see now the site is held the most stock is which, one day is transferred to the most of which is what.

So I decided to take a look, by the way I usually use the program to do automatic crawling process.

STEP.1 Analysis Page

To catch a webpage, it is natural to "study" this page. I usually do it in two different ways:

One is Chrome's Developer Tools. All of the network requests made by the page can be seen through the net function inside it, and most data requests are under the XHR tab. Click on a request to see its specific information, as well as the results of the server's return. Many websites have a dedicated request interface for some data, which returns a set of JSON or XML-formatted data for display at the foreground.

The other is to view the Web page source code directly. This function is usually available in the browser's right-click menu. From the HTML source of the page to find your data directly, analyze its format, for the crawl to prepare.

For a composite page on a snowball, a cursory look at the request it made did not directly find a data interface as expected. Look at the source code, found that there is such a paragraph:

Snb.cubeinfo = {"id": 10289, "name": "Sworn to take the old knife down", "symbol": "ZH010389" ... Skip over 3,000 words here ... "Created_date": "2014.11.25"}snb.cubepiedata = [{"Name": "Car", "weight": "," "Color": "#537299"}];

Snowball nets have changed a lot of rules, many of the previous code estimates are not used
I just wrote a snowball net of the analog login, fuck-login/012 xueqiu.com at Master Xchaoinfo/fuck-login GitHub
On the basis of this modification, can achieve the goal of the main, and can be more simple.
To process cookies, you do not need to log in once every time, you can refer to fuck-login/001 Zhihu at Master Xchaoinfo/fuck-login How GitHub is handled. Requires two module mates:

Crawler module: Simply responsible for capturing and storing data
Data processing module: The processing of the crawler storage. Notify you if a person's position data has changed

The simple flow of the crawler:

Timed access to the target page
Fetch data from the current target page and store it in the database

Simple process of data processing module:

Timed access to the database
Perform your own set of actions when the data in the database meets a condition

Grab the snowball's data? Qiao, just saw an article specifically to say this, recommend to everyone: Internet financial crawler How to write a combination of attention will receive the position notice.

#技术宅都好暴力, you can not see the position of a direct grab ... #我在 @ Chaixiao based on a little bit of optimization, now is the case.

Please fill in the account password before testing

Update content:
Added automatic cookie acquisition
Modified the code that shows the combination change

Import urllib.requestimport jsonimport http.cookiejar# settings Cookiecookiejar = Http.cookiejar.CookieJar () cookieprocessor = Urllib.request.HTTPCookieProcessor (cookiejar) opener = Urllib.request.build_opener (cookieprocessor) Urllib.request.install_opener (opener) #登陆获得cookieparams = Urllib.parse.urlencode ({' username ': ' * * * * * * *, ' password ': ' '}). Encode (encoding= ' UTF8 ') headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.2; WOW64; rv:38.0) gecko/20100101 firefox/38.0 '}request = urllib.request.Request (' Http://xueqiu.com/user/login ', headers= Headers) HTTPF = Opener.open (request, params) #获得组合url = ' Http://xueqiu.com/cubes/rebalancing/history.json?cube_ Symbol=zh340739&count=20&page=1 ' req = urllib.request.Request (url,headers=headers) HTML = Urllib.request.urlopen (req). Read (). Decode (' utf-8 ') data = json.loads (html) stockdata = data[' list '][0][' Rebalancing_ Histories ']for I in range (len (stockdata)):p rint (' Stock name ', end= ': ') print (stockdata[i][' stock_name '],end= ' position change ') print ( stockdata[i][' Prev_weight '],end='--') print (stockdata[i][' target_weight ')

First, you need three libraries: Urllib2,cookielib,json
Then use Firefox to open the oath of the old knife to the next and landing, and then find the cookie file,
The address of the last position record is:/httpxueqiu.com/cubes/rebalancing/history.json?cube_symbol=zh010389& count=20&page=1 with Urllib2 and Coolielib forged header, and cookie access can get the JSON file format of the record, and then the JSON processing can be The main question does not know the attention after the push prompt ... Using the shell



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How do I use Python to crawl a snowball Web page?

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support