Python Collection Example 2

Source: Internet
Author: User

The previous article says we want to collect http://www.gg4493.cn/data, and then:

Step 2: For each link, get its web page content.

It's simple, just open the Urls.txt file and read it one line at a time.
It may seem superfluous here, but based on my strong desire for decoupling, I have written it decisively. If you use object-oriented programming later, it is very convenient to refactor.
It is relatively simple to get the content part of the Web page, but it is necessary to save the contents of the Web page in a folder.
Here are a few new uses:
Copy the code code as follows:


OS.GETCWD () #获得当前文件夹路径
os.path.sep# Current system path delimiter (is this the term?) Under Windows is "\" and Linux is "/"
#判断文件夹是否存在, create a new folder if it does not exist
If Os.path.exists (' newsdir ') = = False:
Os.makedirs (' Newsdir ')
#str () to convert a number to a string
i = 5
STR (i)
With these methods, it is no longer difficult to save a string to a folder under a different file.
Step 3: Enumerate each page to get the target data according to the regular match.
The following method is used to traverse the folder.
Copy the code code as follows:


#这个是用来遍历某个文件夹的
For the parent, Dirnames, filenames in Os.walk (dir):
For dirname in Dirnames
Print parent, dirname
For filename in filenames:
Print parent, filename
Traverse, read, match, and the result comes out.
The regular expression of the data extraction I'm using is this:
Copy the code code as follows:


reg = ' <div class= ' HD ' >.*?In fact, this does not match all the content, because the above news has two formats, the label is a little different, so can only extract one.
Another point is that the extraction of regular expressions is certainly not the mainstream extraction method, if you need to collect other sites, you need to change the regular expression, this is a more troublesome thing.
After the extraction, we can see that the body part will always be mixed with irrelevant information, such as "<script>...</script>" "<p></p>" and so on. So I'm going to slice the body again through regular expressions.
Copy the code code as follows:


def func (str): #谁起的这个名字
STRs = Re.split ("<style>.*?</style>|<script.*?>.*?</script>|&#[0-9]+;|<!--

if! IE

>.+?<!

endif

-->|<.*?> ", str) #各种匹配, via" | " Separated
Ans = "
#将切分的结果组合起来
For each in STRs:
Ans + = each
return ans
In this way the text above the page can be extracted basically.
The whole collection is over.

Source: http://www.m4493.com

Python Collection Example 2

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.