Python web crawler Sina Blog

Source: Internet
Author: User
Tags python web crawler

Last time I wrote a crawl of the century good edge of the crawler, and today to continue to write a Sina blog crawler. After writing, I thought for a while, should not write a note in the blog park, because I think this code of gold is really too low, a bit rehash suspicion, is the last code streamlined a bit, used in another site only, and crawl other people's blog always have a anxiety feeling, Fear of being the friends of the park is the peeping demon. But this code is always I spent energy to write, I do not want to let it deep in the hard disk (the computer is too old, it may be two years after the hard drive is broken, the code disappears), or put out the right as a tip.

Speaking of to climb Sina blog, there is always a reason for it. My reason is also what is simple, is this two days on the internet download a Python natural language processing of a book, like trying to apply the inside of the theory (in fact, I have not seen it haha), of course, according to the example of the book in the game is not a sense of accomplishment, so I want to find something to experiment. Perhaps you think this reason and crawl Sina Blog There is no inevitable contact Ah, indeed, the front of the things Taixu, I write this thing is to meet their own peeping desire, the once goddess of the blog are downloaded (do not spray.) ), by the way, in front of the tall natural language theory to deal with (it is difficult to estimate, probably do not go down).

Speaking of so much nonsense, now start to talk about how the crawler works. First of all, forgive me for the rough coating technology, because I really do not want to appear in the last post of the situation, we have to discuss in the comments I know what to see. First in Sina podcast find you want to download the user's homepage, in the address bar finally generally has a number corresponding to the user's ID, because each page puts the article limited, the Sina blog uses the pagination, corresponds to the URL inside the page=2, is the second page meaning. In the current page we can use regular expressions to parse out the publication time of the article, "View the original" link to find the link to the article.

Because Sina blog more open reason, do not login can also view the content of the article, so directly with the Python urllib2, do not need to set up cookies, save a lot of trouble. The only more troublesome is if the article has a picture is a multimedia link, extract the text is more cumbersome and cumbersome, because this interface is relatively simple, of course, can also directly parse HTML, but I have been accustomed to regular expression, although very troublesome, but still have to finish the bullet. Where the function of the replacement string of regular expressions is very useful, let's focus on this!

The sub function of regular expressions is very easy to use when encountering complex string substitutions. Post a simple code:

#-*-encoding:utf-8-*-import res = u "<div class=" Class1 ">this is String1</div><div class=" Class2 "> This is the string two </DIV> "Def Div_func (m):    if M is None:        return '    return M.group (1) if __name__ = = ' __main__ ': C3/>pattern = re.compile (U ' <div[\s]*?> (. *?) </div> ', Re. U | Re. S | Re. I)    print s    sss = Pattern.sub (Div_func, s)    print    print SSS

In the Web page, will encounter a lot of different formats, such as the code inside the string s, if you want to extract the text inside, for more complex replacement requirements, we can define a function as a parameter passed to sub for processing, the results of the operation as shown, is not very convenient?

Let's talk about Python's Chinese coding problem, the simplest processing is as little as possible with STR, as much as possible with Unicode. For input data from a file, it is best to decode to Unicode and then do the processing, which can reduce the garbled problem by 90%. Oh, yes, today we found a very useful function that can be used to download files

Import urlliburllib.urlretrieve (URL, path)

This function can download the file in the URL to the local path, it is not very simple. Finally, show me. Of course, the data is not much, the goddess is more than 100 articles, with a little overkill database, the direct output to a text file inside is also very convenient ha!

Finally, the source code is pasted out for everyone to refer to ha! (Because I am really not good at describing the details, so the writing is very coarse, I hope that many people forgive me ah!) )

Links: Http://files.cnblogs.com/files/lrysjtu/xlblog.rar

Python web crawler Sina Blog

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.