Reptile Combat "1" crawling an article in a blog park using python

Source: Internet
Author: User

For the first time, we take the blog park as an example.

Cnblog is a typical static Web page, by looking at the source code of the blog post, you can see very little JS code, even the CSS code is relatively simple, very suitable for beginners to practice crawler.

The chestnut of the blog park, our goal is to get all the blog posts of a blogger, the first step today.

The first step: An article is known URL, how do I get the body?

To give a chestnut, we refer to the ' Farmer Uncle ' blog article, haha. He is a blogger of my concern.

Http://www.cnblogs.com/over140/p/4440137.html

This is his article entitled "The Long Tail theory of reading notes".

If we want to store this article, we need to save the content first is the title of the article, and then the body of the article.

How is the title of the article obtained?

Let's take a look at how the title of the article is positioned in the source code of the page.

As can be seen, the title of the text content is contained in a tag inside, we will print out this label:

<a id= "Cb_post_title_url" class= "PostTitle2" href= "http://www.cnblogs.com/over140/p/4440137.html" > "Reading Notes" Long tail theory </a>

The tag has an id attribute, a value of "Cb_post_title_url", a class attribute, a value of "postTitle2", and an href attribute that points to the URL of this article .

This label should be more convenient to locate, so the article title we can quickly find.

The code is as follows:

Importurllib.requestImportReurl='http://www.cnblogs.com/over140/p/4440137.html'req=urllib.request.Request (URL) Resp=Urllib.request.urlopen (req) html_page=resp.read (). Decode ('Utf-8') Title_pattern=r'( <a.*id= "Cb_post_title_url" .*>) (. *) (</a>)'Title_match=Re.search (title_pattern,html_page) title=title_match.group (2)#Print (title)

The title above is the code of the article that we want to crawl.

How does the body of the article get it?

Look at the structure of the article, everything in the body is inside a div tag, but there are a lot of other tags in this div, not just a bunch of text placed in the div tag. For example, there are many <p></p> tags, such as <strong> tags.

How do I get all the content?

I guess that as long as all the content between >< should be available, all the text content should be available. Try the code as follows:

 div_pattern=r " <div> (. *) </div   " div_match  =re.search (div_pattern,html_page) div  =div_match.group (1 #  print (DIV)  result_pattern  =r " > (. *) <   " result_match  =re.findall (Result_pattern,div) Result  = " for  i in   result_match:    result  +=str (i)  print  (Result) 

Unfortunately, the failure ... The printed content contains not only text, but also some included tags, such as <span>.

The defects of regular expressions are shown here. Let's use BeautifulSoup to parse the document.

to use BeautifulSoup to parse the content, please review the article I wrote about getting started with crawlers.

The code to get the div tag where the body is located is as follows:

 from Import beautifulsoupsoup=beautifulsoup (html_page,'html.parser')#  Print (soup.prettify ())div=soup.find (id="post_body" )#print (div.text)print(Div.get_text ())

Haha, done, we got the content of the text. For ease of saving, we save the article to the current directory.

Filename=title+'. txt'withopen (filename,'w', encoding='utf-8') as F:f.write (Div.text)

OK, so far, we have obtained and saved this article.

Here, all the code looks like this:

Importurllib.requestImportRe fromBs4ImportBeautifulsoupurl='http://www.cnblogs.com/over140/p/4440137.html'req=urllib.request.Request (URL) Resp=Urllib.request.urlopen (req) html_page=resp.read (). Decode ('Utf-8') Title_pattern=r'( <a.*id= "Cb_post_title_url" .*>) (. *) (</a>)'Title_match=Re.search (title_pattern,html_page) title=title_match.group (2)#Print (title)" "div_pattern=r ' <div> (. *) </div> ' Div_match=re.search (div_pattern,html_page) div=div_match.group (1) #print (Div) result_pattern=r ' > (. *) < ' Result_match=re.findall (result_pattern,div) result= ' for I in Result_ MATCH:RESULT+=STR (i) print (result)" "Soup=beautifulsoup (Html_page,'Html.parser')#print (Soup.prettify ())Div=soup.find (id="Post_body")#print (Div.text)Print(Div.get_text ()) filename=title+'. txt'with open (filename,'W', encoding='Utf-8') as F:f.write (Div.text)

Reptile Combat "1" crawling an article in a blog park using python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.