Just applied for a blog, the heart is extremely excited. So in order to expand the classification, casually an essay, but also for fear of forgetting the newly learned things because bloggers are very lazy, so this article does not include the installation of Python (and various modules) and Python syntax.
Goal
A few days ago, B station saw a very funny play, the name of "Stupid Girl", in fact, by the same name of the comic cartoon animation. We all know that animation is usually more than a week, it is difficult to meet our needs, so we have to write a crawler, to crawl the comic.
So the goal of this article is to crawl into the "first mix" comic (because the stupid girl I have climbed >_<). I remember this comic book when I was in primary school, and it was my first contact with Japan.
There are a lot of comic sites on the Internet, we will choose the comic house good!
Required Modules
Request for downloading Web pages
The regular module of the re Python comes with
BeautifulSoup is used to parse HTML, which makes it easier to find elements (but this time it doesn't work)
Begin!
STEP1: Determine which Web page address to crawl
Skilled use of Web site search tools, access to search page http://manhua.dmzj.com/tags/search.shtml?s= mix
We found the target address Http://manhua.dmzj.com/chuyinmix
Next open the CH001 and see the picture we're going to crawl
Analyze the image page (F12 Open the tool) so that we can find out the link to the picture
Quickly find the link to the image and the location of the link in the code
But when we look at the source code, we don't find this tag.
Because the crawler crawled pages are static HTML files, without the "processing" of other scripts. So dynamic scripts and the like are not useful (except Phantomjs,selenium ). But as a primer for crawlers, we're still thinking about "bypassing" this annoying thing, "bypassing" the method as a little trick
Tip: If crawling a webpage is difficult, try crawling your mobile Web page
It's easy to find the cell phone's search address. http://m.dmzj.com/search/Initial sound mix.html
The interface is like this.
(the website on the phone looks really low on the computer ...) )
Very easy to find the detailed page http://m.dmzj.com/info/chuyinmix.html
Then open CH001, look at the source code of the Web page, find such a
Mreader.initdata ({"id": 12777, "comic_id": 6132, "Chapter_name": "CH001", "Chapter_order": Ten, "Createtime": 1284436621 ,
"Folder": "c\/\u521d\u97f3miku\/ch001", "Page_url": ["https:\/\/images.dmzj.com\/c\/\u521d\u97f3miku\/ch001\/001. JPG ",
"Https:\/\/images.dmzj.com\/c\/\u521d\u97f3miku\/ch001\/002.png", ...
So obvious address, really is defenseless ...
So we found the picture location and we can start crawling.
STEP2: Crawl from the search page to the detail page to the picture page
Get Python code with a simple operation
1 #!usr/bin/env python2 #Coding=utf-83 Importrequests, re4 fromBs4ImportBeautifulSoup5 6headers={"user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36", 7 "Referer":"http://m.dmzj.com/"}8Url="http://m.dmzj.com/info/chuyinmix.html"9Html=requests.get (URL, headers=headers)TenSoup=beautifulsoup (Html.content,"lxml") One Print(Soup.prettify ())
Here is a simple introduction to the most common, basic, and also most easily bypassed anti-crawler operations
In the HTTP protocol, the client requests a Web page by sending an HTTP request header.
In the request headers that are sent by programs or scripts such as Python, user-agent are not like browsers, they are like "python-requests/2.14.2".
The General browser's user-agent is like "mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) gecko/20100101 firefox/55.0 ".
This allows the server to differentiate between crawlers and browsers through User-agent.
But we can also modify the request header to bypass, but the server is not silly, they also check the request header Referer this item.
Referer refers to which page the browser uses to open the current page.
Just like through Baidu search "first tone my wife", any point open a page when Referer records the Baidu search results page address.
Of course we can also modify Referer hehe.
You can see the source code for this page after you run it.
The regular expression is then searched for the desired content.
#!usr/bin/env python#Coding=utf-8Importrequests, Reheaders={"user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36", "Referer":"http://m.dmzj.com/"}urlroot="http://m.dmzj.com/info/chuyinmix.html"Urlpre="http://m.dmzj.com/view/6132/"HTML=requests.get (UrlRoot, headers=headers) NameList=re.findall ('(? <=chapter_name ":") [^ "]+', Html.text); Idlist=re.findall ('(? <= "id":) [^,]+', html.text); forIinchRange (15): URL=urlpre+idlist[i]+". html"HTML=requests.get (URL, headers=headers)Print(Html.text)
Running results show no problem, so start trying to crawl a picture
1 #!usr/bin/env python2 #Coding=utf-83 Importrequests, RE, OS4 5headers={"user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36", 6 "Referer":"http://m.dmzj.com/"}7urlroot="http://m.dmzj.com/info/chuyinmix.html"8Urlpre="http://m.dmzj.com/view/6132/"9Html=requests.get (UrlRoot, headers=headers)TenNamelist=re.findall ('(? <=chapter_name ":") [^ "]+', html.text); OneIdlist=re.findall ('(? <= "id":) [^,]+', html.text); A forIinchRange (15): -url=urlpre+idlist[i]+". html" -Html=requests.get (URL, headers=headers) the Print(Html.text) -Urllist=re.findall ('https:\\\\/\\\\/[^ "]+', Html.text) - forIDX, stringinchEnumerate (urllist): -Img=requests.get (String.Replace (R"\/","/"). Encode (). Decode ("Unicode-escape"), headers=headers) +Ext=string.split (".") [-1] - if notos.path.exists (Namelist[i]): + Os.mkdir (Namelist[i]) AFile=open (namelist[i]+"\%03d.%s"% (idx, ext),"AB") at file.write (img.content) - file.close () - Break - Break
tip Two: Use Str.encode (). Decode ("Unicode-escape") to convert the Unicode encoding in the URL it seems to be successful. Hehe >_<
So let go, crawl all
1 #!usr/bin/env python2 #Coding=utf-83 Importrequests, RE, OS4 5headers={"user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36", 6 "Referer":"http://m.dmzj.com/"}7urlroot="http://m.dmzj.com/info/chuyinmix.html"8Urlpre="http://m.dmzj.com/view/6132/"9Html=requests.get (UrlRoot, headers=headers)TenNamelist=re.findall ('(? <=chapter_name ":") [^ "]+', html.text); OneIdlist=re.findall ('(? <= "id":) [^,]+', html.text); A forIinchRange (15): -url=urlpre+idlist[i]+". html" -Html=requests.get (URL, headers=headers) theUrllist=re.findall ('https:\\\\/\\\\/[^ "]+', Html.text) - forIDX, stringinchEnumerate (urllist): -Img=requests.get (String.Replace (R"\/","/"). Encode (). Decode ("Unicode-escape"), headers=headers) -Ext=string.split (".") [-1] + if notos.path.exists (Namelist[i]): - Os.mkdir (Namelist[i]) +File=open (namelist[i]+"\%03d.%s"% (idx, ext),"AB") A file.write (img.content) atFile.close ()
A minute later, the crawl succeeds >_<
Summary of this crawl simple use of Python requests, Re, OS Library, a brief introduction of the HTTP request header and the most common anti-crawler mechanism, purely entertaining articles haha
Python crawler simple Introduction and Tips