Web crawler technology is very popular on the internet, and using Python to write web crawler is very convenient. The author last year because of personal need to write a copy of the animation for the crawl P station of the crawler, now want to use it as an example of the web crawler to write a method to introduce to you.
The main use of the Python library is the Urllib,python version of 3.5.2. In addition, the browser used is the Firefox browser.
2017.3.22 Update-----------uploaded the source code to GitHub, interested in can download.
Enter the P station homepage, we can see the following such a variety of images, can be viewed online in P station. Sometimes we want to download them from P station, and when we have free time to watch. The author has this idea, but a piece of the download is too slow to save, so want to write a program to achieve such a process.
First, let's make a clear goal. The goal of crawling is to appear in the daily charts in P stations, because the quality of the pictures in the charts is generally higher.
Then naturally want to clear we need to crawl the page of the link (URL), since to catch the daily list, link is http://www.pixiv.net/ranking.php?mode=daily. Link it before we use the browser to view the source code, because we write the program is actually imitating the operation of the browser.
Press F12 on the open page to enter developer mode, using the feature of the Select Page element. So we can understand the div where a picture is located.
So where is the URL of the picture we want to get? Click on the icon identified above and then click on the corresponding picture, you can find its div.
We looked at the first picture, and it's div as shown below:
The red line identifies this picture at P station's source address is http://i3.pixiv.net/c/240x480/img-master/img/2017/03/12/18/02/13/61873774_p0_master1200.jpg.
Open this address, and sure enough we saw the picture displayed on the charts.
But there is a problem, if we follow this link to download, we will find that the picture is very small.
Resolution only 240*139, which is obviously unscientific. This means that P station does not directly refer to the original image, where is the original image.
After clicking directly on the list, you will enter the details of this chart.
After clicking on the Details page, we find that the original image presented by the author is presented in front of us.
What is the URL of this picture? Also continue using F12 to enter debug mode and view its URL.
The address is obvious.
Using this URL to download the results are normal.
This address is different from what we get from the list directly. We'll bring them over for comparison.
The link on the list: Http://i3.pixiv.net/c/240x480/img-master/img/2017/03/12/18/02/13/61873774_p0_master1200.jpg.
Original link: http://i3.pixiv.net/img-original/img/2017/03/12/18/02/13/61873774_p0.jpg.
Obviously, p station put the original image into the Img-original folder, the list shows the processed pictures. Through the analysis of many pictures, we find that the two links are related to most pictures.
First their domain is the same, p station has many servers, but the two pictures are basically on the same server. Then their prefix is only "/c/240x480/img-master/" and "/img-original/" there is a difference, the suffix is only in the p0 after a value, one does not. In addition, most of the picture's ranking link is a jpg suffix, while some pictures of the original link is not jpg, but PNG. That is, the original link is divided into jpg and png two kinds, but the picture on the list is only JPG one format.
With such an analysis, we can see what we should do. Transform the strings, piecing together the original links from the links on the top of the list.
All right, open it up. Here to introduce the Urllib library. This is a library for network connections in Python3, python2.7 's name is URLLIB2.
The Urllib.request.urlopen function accepts a URL parameter to simulate opening the Web page and returning the received data. By returning the read function of the value, a string of returned data can be obtained, which is usually binary encoded and can be decoded into a UTF8 format using decode.
Use the Urllib library to send a request to the server to obtain the corresponding HTML file. Write a function gethtml
def gethtml (URL):
page = urllib.request.urlopen (URL)
html = Page.read (). Decode (' Utf-8 ') return
HTML
You can see the Python code is very simple, very easy to use.
The regular expression is then used to get the picture link in the page. Here you will use the browser's ability to view the source code to analyze the HTML.
This is the source code that is checked to see. Using keyword lookup, you can find that links are all written on line No. 353.
The source of the first picture is
Using this construct a regular expression
R ' data-filter= ' Thumbnail-filterlazy-image "data-src=" (. +?\.jpg) "Data-type"
Using Python's regular expression functionality, you can then get all the required picture links in that HTML.
After getting this connection, the next thing to do is to transform it. Using the Replace function of STR, make the following substitutions:
Url=url.replace (' C/240x480/img-master ', ' img-original ')
url=url.replace (' _master1200 ', ')
Here is the play, download the picture to the local.
Because the original link may be PNG and JPG, you have to prepare your hands. Use the version of JPG first to request a picture resource, and if it fails, try using the PNG version to request it, so there is no problem.
Try:
req = urllib.request.Request (URL, None, Headers)
res = Urllib.request.urlopen (req, timeout=1)
Res.close ()
except urllib.error.HTTPError:
url = url.replace ('. jpg ', '. png ')
req = Urllib.request.Request (URL, None, Headers)
res = Urllib.request.urlopen (req, timeout=1)
res.close ()
In addition, there is a referer problem in the request to the server when the picture P station in order to prevent hotlinking, if the request in the head inside the referer is not the case will request failure. So you need to ask for a referer at the request. Its format is "http://www.pixiv.net/member_illust.php?mode=manga_big&illust_id=" plus the ID number of the picture plus "&page=0". In fact, that's the link before we click on the original artwork. A getreferer function was constructed to deal with this problem.
def getreferer (URL):
reference = "Http://www.pixiv.net/member_illust.php?mode=manga_big&illust_id="
Reg = R '. +/(\d+) _p0 ' return
reference + re.findall (Reg,url) [0] + "&page=0"
The data obtained after the request is in res, using Res.read () to fetch the data, and then write them to the local file, so that a picture is downloaded.
Cycle of such work, you can download the list of all the pictures in the day.
You can also choose to crawl data from a list of different dates. P Stations Store this data in the primary server and can access it. It is mainly through the date parameter in the link to identify which day the data is being requested.
For example, this http://www.pixiv.net/ranking.php?mode=daily&date=20170312 is a request for the March 12, 2017 data. The rest of the operation is similar to the above, just change the link, and this link can be constructed by their own way to obtain. So we can get the P station picture data of any number of days.
Finally, in this article I just outlined the basic idea of the crawler and the basic method of implementation, by crawling the P station picture as an example to explain. Of course, the content of the explanation here is relatively simple, some things do not go deep to expand, such as crawling pictures can use multiple process technology to increase efficiency, crawling can be based on different needs to get different pictures.