Python collects Tencent news example _python

Source: Internet
Author: User

The goal is to crawl all the news on the Tencent News homepage and get the name, time, source, and text of each news story.

Next, break down the target and do it step-by-step.

Step 1: crawl all the links on the home page and write them to the file.

Python is handy for getting HTML, and a few lines of code can do what we need.

Copy Code code as follows:
def gethtml (URL):
page = Urllib.urlopen (URL)
html = Page.read ()
Page.close ()
return HTML

We all know that the tag of the HTML link is "a" and the linked attribute is "href", which is to get all the TAG=A,ATTRS=HREF values in the HTML.

Having consulted the information, I intended to use Htmlparser at first, and I wrote it. However, it has a problem, that is, when the Chinese characters are encountered can not be processed.

Copy Code code as follows:

Class parser (Htmlparser.htmlparser):
def handle_starttag (self, Tag, attrs):
if tag = = ' A ':
For attr, value in Attrs:
if attr = = ' href ':
Print value

Later, when Sgmlparser was used, it did not have the problem.
Copy Code code as follows:

Class Urlparser (Sgmlparser):
def reset (self):
Sgmlparser.reset (self)
Self.urls = []

def start_a (self,attrs):
href = [V for k,v in attrs if k== ' href ']
If href:
Self.urls.extend (HREF)

Sgmlparser you need to overload its function for a label, here's where you put all the links in the URLs of that class.

Copy Code code as follows:

Lparser = Urlparser () #分析器来的
Socket = Urllib.urlopen ("http://news.qq.com/") #打开这个网页

Fout = File (' Urls.txt ', ' W ') #要把链接写到这个文件里
Lparser.feed (Socket.read ()) #分析啦

reg = ' http://news.qq.com/a/.* ' #这个是用来匹配符合条件的链接, using regular expression matching
Pattern = Re.compile (reg)

For URL in Lparser.urls: #链接都存在urls里
If Pattern.match (URL):
Fout.write (url+ ' \ n ')

Fout.close ()

In this way, all qualified links are saved to the Urls.txt file.

Step 2: for each link, get the content of its web page.

It's easy to just open the Urls.txt file and read it one line at a time.

It may seem superfluous here, but based on my strong desire for decoupling, I have decisively written in the document. Later, if object-oriented programming is used, refactoring is very convenient.

Getting the content part of a Web page is also relatively straightforward, but you need to keep the content of the Web page in a folder.

Here are a few new uses:

Copy Code code as follows:

OS.GETCWD () #获得当前文件夹路径
os.path.sep# Current system path Separator (is this the name?) Under Windows is "\", Linux is "/"

#判断文件夹是否存在, create a new folder if it does not exist
If Os.path.exists (' newsdir ') = = False:
Os.makedirs (' Newsdir ')

#str () used to convert a number to a string
i = 5
STR (i)

With these methods, it is no longer difficult to save the string to a different file in a folder.

Step 3: enumerate each Web page and get the target data based on a regular match.

The following method is used to traverse the folder.

Copy Code code as follows:

#这个是用来遍历某个文件夹的
For parent, Dirnames, filenames in Os.walk (dir):
For dirname in Dirnames
Print parent, dirname
For filename in filenames:
Print parent, filename

Traversal, read, match, the result came out.

The regular expression that I use to extract the data is this:

Copy Code code as follows:

reg = ' <div class= ' HD ' >.*?

In fact, this does not match the Tencent network of all the news, because the above news has two formats, the label has a little difference, so can only extract one.

Another point is that through regular expression extraction is definitely not the mainstream extraction method, if you need to collect other sites, you need to change the regular expression, this is a more troublesome thing.

After the extraction of observation, the body will always be mixed with some irrelevant information, such as "<script>...</script>" "<p></p>" and so on. So I'll slice the body again through regular expressions.

Copy Code code as follows:

def func (str): #谁起的这个名字
STRs = Re.split ("<style>.*?</style>|<script.*?>.*?</script>|&#[0-9]+;|<!--\[if! Ie\]>.+?<!\[endif\]-->|<.*?> ", str" #各种匹配, through "|" Separated
Ans = '
#将切分的结果组合起来
For each in STRs:
Ans = each
return ans

So Tencent net above the text basically all can extract out.

The whole collection is over.

Show me the results I extracted (do not use automatic wrapping, hidden on the right):

Attention:

1, open a Web site, if the URL is bad (not open), if not processed will be an error. I simply use the way to handle exceptions, and there should be other ways to do that.

Copy Code code as follows:

Try
Socket = Urllib.urlopen (URL)
Except
Continue

2, "." In the python regular expression. Number, you can match any character, but except "\ n".

3, how to remove the end of the string "\ n"? Python's handling is so graceful that you die!

Copy Code code as follows:
If line[-1] = = ' \ n ':
line = Line[0:-1]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.