Python collects Tencent news example

Python collects Tencent news example _python

Last Update:2017-01-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The goal is to crawl all the news on the Tencent News homepage and get the name, time, source, and text of each news story.

Next, break down the target and do it step-by-step.

Step 1: crawl all the links on the home page and write them to the file.

Python is handy for getting HTML, and a few lines of code can do what we need.

Copy Code code as follows:

def gethtml (URL):
page = Urllib.urlopen (URL)
html = Page.read ()
Page.close ()
return HTML

We all know that the tag of the HTML link is "a" and the linked attribute is "href", which is to get all the TAG=A,ATTRS=HREF values in the HTML.

Having consulted the information, I intended to use Htmlparser at first, and I wrote it. However, it has a problem, that is, when the Chinese characters are encountered can not be processed.

Copy Code code as follows:

Class parser (Htmlparser.htmlparser):
def handle_starttag (self, Tag, attrs):
if tag = = ' A ':
For attr, value in Attrs:
if attr = = ' href ':
Print value

Later, when Sgmlparser was used, it did not have the problem.

Copy Code code as follows:

Class Urlparser (Sgmlparser):
def reset (self):
Sgmlparser.reset (self)
Self.urls = []

def start_a (self,attrs):
href = [V for k,v in attrs if k== ' href ']
If href:
Self.urls.extend (HREF)

Sgmlparser you need to overload its function for a label, here's where you put all the links in the URLs of that class.

Copy Code code as follows:

Lparser = Urlparser () #分析器来的
Socket = Urllib.urlopen ("http://news.qq.com/") #打开这个网页

Fout = File (' Urls.txt ', ' W ') #要把链接写到这个文件里
Lparser.feed (Socket.read ()) #分析啦

reg = ' http://news.qq.com/a/.* ' #这个是用来匹配符合条件的链接, using regular expression matching
Pattern = Re.compile (reg)

For URL in Lparser.urls: #链接都存在urls里
If Pattern.match (URL):
Fout.write (url+ ' \ n ')

Fout.close ()

In this way, all qualified links are saved to the Urls.txt file.

Step 2: for each link, get the content of its web page.

It's easy to just open the Urls.txt file and read it one line at a time.

It may seem superfluous here, but based on my strong desire for decoupling, I have decisively written in the document. Later, if object-oriented programming is used, refactoring is very convenient.

Getting the content part of a Web page is also relatively straightforward, but you need to keep the content of the Web page in a folder.

Here are a few new uses:

Copy Code code as follows:

OS.GETCWD () #获得当前文件夹路径
os.path.sep# Current system path Separator (is this the name?) Under Windows is "\", Linux is "/"

#判断文件夹是否存在, create a new folder if it does not exist
If Os.path.exists (' newsdir ') = = False:
Os.makedirs (' Newsdir ')

#str () used to convert a number to a string
i = 5
STR (i)

With these methods, it is no longer difficult to save the string to a different file in a folder.

Step 3: enumerate each Web page and get the target data based on a regular match.

The following method is used to traverse the folder.

Copy Code code as follows:

#这个是用来遍历某个文件夹的
For parent, Dirnames, filenames in Os.walk (dir):
For dirname in Dirnames
Print parent, dirname
For filename in filenames:
Print parent, filename

Traversal, read, match, the result came out.

The regular expression that I use to extract the data is this:

Copy Code code as follows:

reg = ' <div class= ' HD ' >.*?

In fact, this does not match the Tencent network of all the news, because the above news has two formats, the label has a little difference, so can only extract one.

Another point is that through regular expression extraction is definitely not the mainstream extraction method, if you need to collect other sites, you need to change the regular expression, this is a more troublesome thing.

After the extraction of observation, the body will always be mixed with some irrelevant information, such as "<script>...</script>" "<p></p>" and so on. So I'll slice the body again through regular expressions.

Copy Code code as follows:

So Tencent net above the text basically all can extract out.

The whole collection is over.

Show me the results I extracted (do not use automatic wrapping, hidden on the right):

Attention:

1, open a Web site, if the URL is bad (not open), if not processed will be an error. I simply use the way to handle exceptions, and there should be other ways to do that.

Copy Code code as follows:

Try
Socket = Urllib.urlopen (URL)
Except
Continue

2, "." In the python regular expression. Number, you can match any character, but except "\ n".

3, how to remove the end of the string "\ n"? Python's handling is so graceful that you die!

Copy Code code as follows:

If line[-1] = = ' \ n ':
line = Line[0:-1]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python collects Tencent news example _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python collects Tencent news example _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support