Reptile Combat "1" crawling an article in a blog park using python

Last Update:2017-11-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For the first time, we take the blog park as an example.

Cnblog is a typical static Web page, by looking at the source code of the blog post, you can see very little JS code, even the CSS code is relatively simple, very suitable for beginners to practice crawler.

The chestnut of the blog park, our goal is to get all the blog posts of a blogger, the first step today.

The first step: An article is known URL, how do I get the body?

To give a chestnut, we refer to the ' Farmer Uncle ' blog article, haha. He is a blogger of my concern.

Http://www.cnblogs.com/over140/p/4440137.html

This is his article entitled "The Long Tail theory of reading notes".

If we want to store this article, we need to save the content first is the title of the article, and then the body of the article.

How is the title of the article obtained?

Let's take a look at how the title of the article is positioned in the source code of the page.

As can be seen, the title of the text content is contained in a tag inside, we will print out this label:

<a id= "Cb_post_title_url" class= "PostTitle2" href= "http://www.cnblogs.com/over140/p/4440137.html" > "Reading Notes" Long tail theory </a>

The tag has an id attribute, a value of "Cb_post_title_url", a class attribute, a value of "postTitle2", and an href attribute that points to the URL of this article .

This label should be more convenient to locate, so the article title we can quickly find.

The code is as follows:

Importurllib.requestImportReurl='http://www.cnblogs.com/over140/p/4440137.html'req=urllib.request.Request (URL) Resp=Urllib.request.urlopen (req) html_page=resp.read (). Decode ('Utf-8') Title_pattern=r'( <a.*id= "Cb_post_title_url" .*>) (. *) (</a>)'Title_match=Re.search (title_pattern,html_page) title=title_match.group (2)#Print (title)

The title above is the code of the article that we want to crawl.

How does the body of the article get it?

Look at the structure of the article, everything in the body is inside a div tag, but there are a lot of other tags in this div, not just a bunch of text placed in the div tag. For example, there are many <p></p> tags, such as <strong> tags.

How do I get all the content?

I guess that as long as all the content between >< should be available, all the text content should be available. Try the code as follows:

 div_pattern=r " <div> (. *) </div   " div_match  =re.search (div_pattern,html_page) div  =div_match.group (1 #  print (DIV)  result_pattern  =r " > (. *) <   " result_match  =re.findall (Result_pattern,div) Result  = " for  i in   result_match:    result  +=str (i)  print  (Result)

Unfortunately, the failure ... The printed content contains not only text, but also some included tags, such as <span>.

The defects of regular expressions are shown here. Let's use BeautifulSoup to parse the document.

to use BeautifulSoup to parse the content, please review the article I wrote about getting started with crawlers.

The code to get the div tag where the body is located is as follows:

 from Import beautifulsoupsoup=beautifulsoup (html_page,'html.parser')#  Print (soup.prettify ())div=soup.find (id="post_body" )#print (div.text)print(Div.get_text ())

Haha, done, we got the content of the text. For ease of saving, we save the article to the current directory.

Filename=title+'. txt'withopen (filename,'w', encoding='utf-8') as F:f.write (Div.text)

OK, so far, we have obtained and saved this article.

Here, all the code looks like this:

Importurllib.requestImportRe fromBs4ImportBeautifulsoupurl='http://www.cnblogs.com/over140/p/4440137.html'req=urllib.request.Request (URL) Resp=Urllib.request.urlopen (req) html_page=resp.read (). Decode ('Utf-8') Title_pattern=r'( <a.*id= "Cb_post_title_url" .*>) (. *) (</a>)'Title_match=Re.search (title_pattern,html_page) title=title_match.group (2)#Print (title)" "div_pattern=r ' <div> (. *) </div> ' Div_match=re.search (div_pattern,html_page) div=div_match.group (1) #print (Div) result_pattern=r ' > (. *) < ' Result_match=re.findall (result_pattern,div) result= ' for I in Result_ MATCH:RESULT+=STR (i) print (result)" "Soup=beautifulsoup (Html_page,'Html.parser')#print (Soup.prettify ())Div=soup.find (id="Post_body")#print (Div.text)Print(Div.get_text ()) filename=title+'. txt'with open (filename,'W', encoding='Utf-8') as F:f.write (Div.text)

Reptile Combat "1" crawling an article in a blog park using python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Reptile Combat "1" crawling an article in a blog park using python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Reptile Combat "1" crawling an article in a blog park using python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support