Python crawler simple Introduction and Tips

Source: Internet
Author: User

Just applied for a blog, the heart is extremely excited. So in order to expand the classification, casually an essay, but also for fear of forgetting the newly learned things because bloggers are very lazy, so this article does not include the installation of Python (and various modules) and Python syntax.

Goal

A few days ago, B station saw a very funny play, the name of "Stupid Girl", in fact, by the same name of the comic cartoon animation. We all know that animation is usually more than a week, it is difficult to meet our needs, so we have to write a crawler, to crawl the comic.

So the goal of this article is to crawl into the "first mix" comic (because the stupid girl I have climbed >_<). I remember this comic book when I was in primary school, and it was my first contact with Japan.

There are a lot of comic sites on the Internet, we will choose the comic house good!

Required Modules

Request for downloading Web pages

The regular module of the re Python comes with

BeautifulSoup is used to parse HTML, which makes it easier to find elements (but this time it doesn't work)

Begin!

STEP1: Determine which Web page address to crawl

Skilled use of Web site search tools, access to search page http://manhua.dmzj.com/tags/search.shtml?s= mix

We found the target address Http://manhua.dmzj.com/chuyinmix

Next open the CH001 and see the picture we're going to crawl

Analyze the image page (F12 Open the tool) so that we can find out the link to the picture

Quickly find the link to the image and the location of the link in the code

But when we look at the source code, we don't find this tag.

Because the crawler crawled pages are static HTML files, without the "processing" of other scripts. So dynamic scripts and the like are not useful (except Phantomjs,selenium ). But as a primer for crawlers, we're still thinking about "bypassing" this annoying thing, "bypassing" the method as a little trick

Tip: If crawling a webpage is difficult, try crawling your mobile Web page

It's easy to find the cell phone's search address. http://m.dmzj.com/search/Initial sound mix.html

The interface is like this.

(the website on the phone looks really low on the computer ...) )

Very easy to find the detailed page http://m.dmzj.com/info/chuyinmix.html

Then open CH001, look at the source code of the Web page, find such a

Mreader.initdata ({"id": 12777, "comic_id": 6132, "Chapter_name": "CH001", "Chapter_order": Ten, "Createtime": 1284436621 ,
"Folder": "c\/\u521d\u97f3miku\/ch001", "Page_url": ["https:\/\/images.dmzj.com\/c\/\u521d\u97f3miku\/ch001\/001. JPG ",
"Https:\/\/images.dmzj.com\/c\/\u521d\u97f3miku\/ch001\/002.png", ...

So obvious address, really is defenseless ...

So we found the picture location and we can start crawling.

STEP2: Crawl from the search page to the detail page to the picture page

Get Python code with a simple operation

1 #!usr/bin/env python2 #Coding=utf-83 Importrequests, re4  fromBs4ImportBeautifulSoup5 6headers={"user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36", 7     "Referer":"http://m.dmzj.com/"}8Url="http://m.dmzj.com/info/chuyinmix.html"9Html=requests.get (URL, headers=headers)TenSoup=beautifulsoup (Html.content,"lxml") One Print(Soup.prettify ())

Here is a simple introduction to the most common, basic, and also most easily bypassed anti-crawler operations

In the HTTP protocol, the client requests a Web page by sending an HTTP request header.

In the request headers that are sent by programs or scripts such as Python, user-agent are not like browsers, they are like "python-requests/2.14.2".

The General browser's user-agent is like "mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) gecko/20100101 firefox/55.0 ".

This allows the server to differentiate between crawlers and browsers through User-agent.

But we can also modify the request header to bypass, but the server is not silly, they also check the request header Referer this item.

Referer refers to which page the browser uses to open the current page.

Just like through Baidu search "first tone my wife", any point open a page when Referer records the Baidu search results page address.

Of course we can also modify Referer hehe.

You can see the source code for this page after you run it.

The regular expression is then searched for the desired content.

#!usr/bin/env python#Coding=utf-8Importrequests, Reheaders={"user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36",     "Referer":"http://m.dmzj.com/"}urlroot="http://m.dmzj.com/info/chuyinmix.html"Urlpre="http://m.dmzj.com/view/6132/"HTML=requests.get (UrlRoot, headers=headers) NameList=re.findall ('(? <=chapter_name ":") [^ "]+', Html.text); Idlist=re.findall ('(? <= "id":) [^,]+', html.text); forIinchRange (15): URL=urlpre+idlist[i]+". html"HTML=requests.get (URL, headers=headers)Print(Html.text)

Running results show no problem, so start trying to crawl a picture

1 #!usr/bin/env python2 #Coding=utf-83 Importrequests, RE, OS4 5headers={"user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36", 6     "Referer":"http://m.dmzj.com/"}7urlroot="http://m.dmzj.com/info/chuyinmix.html"8Urlpre="http://m.dmzj.com/view/6132/"9Html=requests.get (UrlRoot, headers=headers)TenNamelist=re.findall ('(? <=chapter_name ":") [^ "]+', html.text); OneIdlist=re.findall ('(? <= "id":) [^,]+', html.text); A  forIinchRange (15): -url=urlpre+idlist[i]+". html" -Html=requests.get (URL, headers=headers) the     Print(Html.text) -Urllist=re.findall ('https:\\\\/\\\\/[^ "]+', Html.text) -      forIDX, stringinchEnumerate (urllist): -Img=requests.get (String.Replace (R"\/","/"). Encode (). Decode ("Unicode-escape"), headers=headers) +Ext=string.split (".") [-1] -         if  notos.path.exists (Namelist[i]): + Os.mkdir (Namelist[i]) AFile=open (namelist[i]+"\%03d.%s"% (idx, ext),"AB") at file.write (img.content) - file.close () -          Break -      Break


tip Two: Use Str.encode (). Decode ("Unicode-escape") to convert the Unicode encoding in the URL it seems to be successful. Hehe >_<

So let go, crawl all

1 #!usr/bin/env python2 #Coding=utf-83 Importrequests, RE, OS4 5headers={"user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36", 6     "Referer":"http://m.dmzj.com/"}7urlroot="http://m.dmzj.com/info/chuyinmix.html"8Urlpre="http://m.dmzj.com/view/6132/"9Html=requests.get (UrlRoot, headers=headers)TenNamelist=re.findall ('(? <=chapter_name ":") [^ "]+', html.text); OneIdlist=re.findall ('(? <= "id":) [^,]+', html.text); A  forIinchRange (15): -url=urlpre+idlist[i]+". html" -Html=requests.get (URL, headers=headers) theUrllist=re.findall ('https:\\\\/\\\\/[^ "]+', Html.text) -      forIDX, stringinchEnumerate (urllist): -Img=requests.get (String.Replace (R"\/","/"). Encode (). Decode ("Unicode-escape"), headers=headers) -Ext=string.split (".") [-1] +         if  notos.path.exists (Namelist[i]): - Os.mkdir (Namelist[i]) +File=open (namelist[i]+"\%03d.%s"% (idx, ext),"AB") A file.write (img.content) atFile.close ()

A minute later, the crawl succeeds >_<

Summary of this crawl simple use of Python requests, Re, OS Library, a brief introduction of the HTTP request header and the most common anti-crawler mechanism, purely entertaining articles haha

Python crawler simple Introduction and Tips

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.