How can I capture snowball webpages using Python?

Last Update:2018-05-06 Source: Internet

Author: User

Tags network function

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I want to use beautifulsoup or other python packages to capture some combinations on the snowball web page, because when the position of the snowball network combination changes, the snowball network will not prompt, for example, I want to capture this xueqiu. comPZH010389. The basic idea is to use a program to track his position, and when there is a change, the program will give me a prompt. # In short, the thing to do is to open this interface, open the warehouse redeployment history on this interface, and then record his current position, compared with the previous one, I want to use beautifulsoup or other python packages to capture some combinations on the snowball web page, because when the combined positions of the snowball network change, the snowball network will not prompt, for example, I want to crawl this http://xueqiu.com/P/ZH010389 . The basic idea is to use a program to track his position, and when there is a change, the program will give me a prompt.

# In short, the thing to do is to open this interface, open the warehouse redeployment history on this interface, record his current position, and compare the previous positions. ##

The problem is: I don't know much about HTML. When I open Chrome's developer tool, I don't know how to enable my program to open its warehouse transfer history...

This problem may be too small... Trouble !!!

Reply content:

// A lot of people say that there is a reminder for attention now ...... Er, this feature is obviously unavailable when the subject asks a question. I wrote this exercise only when I was learning crawler. I don't want to go into a snowball without stock trading ......

// A Lot Of likes. How do I get started with Python crawlers? -Duan xiaochen's answer

Write while performing edge adjustment ~
# Start coding
First, you must know what you are crawling ~ The idea of finding the HTML code cloud is actually wrong. Because the content we want is not in the original html. But in the communication between the browser and the server, we only need to find this part of data.
# I used Firefox FireBug.
Select the Network (Chrome should be the Network) and click the warehouse transfer history,

import urllib.requesturl = 'http://xueqiu.com/cubes/rebalancing/history.json?cube_symbol=ZH010389&count=20&page=1'req = urllib.request.Request(url,headers=headers)html = urllib.request.urlopen(req).read().decode('utf-8')print(html)

Now, when you focus on a combination, there will be a prompt for changes in positions. But I think this is quite interesting. For example, you can capture data of many positions and perform some comprehensive analysis to see which of the most held shares on the website is, which of the following is the most transferred in one day.

So I decided to capture it. By the way, I usually use a program for automatic crawling.

Step 1: Analysis page

To capture a webpage, you must first "study" the webpage. Generally, two methods are used:

One is Chrome's Developer Tools. The Network function in it shows all Network requests sent by the page, and most data requests are under the XHR tag. Click a request to view its specific information and the server's returned results. Many websites have special request interfaces for some data, and return a set of json or XML data for the front-end to display after processing.

The other is to directly view the webpage source code. This function is usually available in the context menu of the browser. Search for the data you want from the HTML source code of the page, analyze the format, and prepare for capturing.

For a combination page on a snowball, I roughly looked at the request and did not directly find a data interface as expected. Look at the source code and find the following paragraph:

SNB. cubeInfo = {"id": 10289, "name": "swear to take the old knife to the next position", "symbol": "ZH010389 "... it is skipped three thousand words... "created_date": "2014.11.25"} SNB. cubePieData = [{"name": "car", "weight": 100, "color": "#537299"}];

Xueqiu has changed many rules, and many previous Code may not be used.
I just wrote a simulated logon to xueqiu, fuck-login/012 xueqiu.com at master · xchaoinfo/fuck-login · GitHub.
On this basis, modification can achieve the goal of the subject, and can be more simple.
To process cookies, you do not need to log on once every time. For details, refer to fuck-login/001 zhihu at master · xchaoinfo/fuck-login · GitHub.. Two modules are required:

Crawler module: only responsible for capturing and storing data
Data processing module: processes the data stored by crawlers. If a person's position data changes, send a notification to you

The simple process of the crawler is as follows:

Scheduled access to the target page
Capture the data on the current target page and save it to the database

Simple process of the data processing module:

Regular Database Access
When the data in the database meets a certain condition, perform the action you set.

Capture snowball data? Coincidentally, I just saw an article dedicated to this and recommended it to everyone: How to Write Internet Financial crawlers?A warehouse redeployment notification is sent to a group that has been followed.

# The Technology House is violent. If you don't know how to adjust the warehouse, you can catch it directly... # I have made some optimizations on the basis of @ Duan xiaochen. Currently, this is the case.

Enter your account and password before testing.

Update content:
Added Automatic cookie Retrieval
Modified the code for displaying the changed combination

Import urllib. requestimport jsonimport http. cookiejar # Set cookieCookieJar = http. cookiejar. cookieJar () CookieProcessor = urllib. request. HTTPCookieProcessor (CookieJar) opener = urllib. request. build_opener (CookieProcessor) urllib. request. install_opener (opener) # log on to cookieparams = urllib. parse. urlencode ({'username': ****** ', 'Password ':'*****'}). encode (encoding = 'utf8') headers = {'user-agent': 'mozilla/5.0 (Windows NT 6.2; WOW64; rv: 38.0) gecko/20100101 Firefox/38.0 '} request = urllib. request. request (' http://xueqiu.com/user/login ', Headers = headers) httpf = opener. open (request, params) # obtain the combined url =' http://xueqiu.com/cubes/rebalancing/history.json?cube_symbol=ZH340739&count=20&page=1 'Req = urllib. request. request (url, headers = headers) html = urllib. request. urlopen (req ). read (). decode ('utf-8') data = json. loads (html) stockdata = data ['LIST'] [0] ['rebalancing _ histories '] for I in range (len (stockdata): print ('Stock name ', end = ':') print (stockdata [I] ['stock _ name'], end = 'position change ') print (stockdata [I] ['prev _ weight '], end =' --> ') print (stockdata [I] ['target _ weight'])

Three libraries are required: urllib2, cookielib, and json.
Then open it in firefox and swear to take the old knife down.Log on to the system and find the cookie file,
The last warehouse transfer record address is: http://xueqiu.com/cubes/rebalancing/history.json? Cube_symbol = ZH010389 & count = 20 & page = 1Use urllib2 and coolielib to forge the header and access the cookie to obtain the warehouse transfer record in the json file format, then, you can use json processing to solve the problem. Don't you know if there is a push prompt after you pay attention to it ...... use shell

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More