A simple implementation of document extraction and parsing in Python

Last Update:2015-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the previous article

was called to do web crawler, crawling Sina News URL, title, content and comments, but a little change in demand, mainly is the comment interval is required to have ' \ t ' division, such as the

<comment>
2014-12-10 18:53:20 1004400533 abandon their own flesh and blood pig dog, is not afraid to be condemned by conscience? Can you sleep soundly?
2014-12-10 17:17:07 3294923134 is this father a man?
</comment>

Switch

Try to re-crawl, but Sina in 2015 years later, made some changes, one comment API content has some changes, from the original Chinese characters into the ' \u ', the solution can refer to another Bovins tell you the Python crawler when the conversion \u Chinese characters "pit", Second, the crawler needs to do a good job of browser simulation, or it is easy to be blocked, and, when doing large-scale crawl, need to use multi-threaded or distributed means to do, multithreading a little attention, to the system thread upper limit, this is to pay attention to.

Second, extraction and analysis

Because I want to crawl this page (Social) a year of content, has more than 20,000 (only those who have comments on the news), so often the next day to find the halfway error. The distributed computing stuff is not going to learn yet, so it's time to review Python.

In this short program, using the Sgmlparser class for parsing, this class encapsulates the ability to implement a string with a tag processing function, of course, there is no label can be used, do not need to be called HTML text kind of requirements, Many examples on the web may be explained by Htmlparser and Sgmlparser, by parsing x.html. I only use a single-threaded approach here.

Sgmlparser:

1, by using the method of switching variables to extract and parse the contents of the tag, before this, the need to initialize the variables, you can use __init__ () or __reset__ (), generally with __reset__ ().

2, Start_x (), end_x (), Handle_data (), these three functions, the first one is the function that executes when reading to <x>, the first is the function that executes when reading to </x>, the third is when the contents of the tag are encountered, Will call this function, in more detail can refer to this blog post

Variable Description:

RootDir: Original file directory

RootDir: Need to store the converted file directory.

Method Description:

1, read the contents of <comment></comment> by Sgmlparser, call Solve () function, use the regular matching method to replace the original non-conforming space with ' \ t ', Returns the content that needs to be replaced to the comment variable

2. Use regular match to locate <comment></comment> Replace with comment value of Parse class object

(The methods of regular expressions refer to the following comments)

Full code:

#!/usr/bin/env python#-*-coding:utf-8-*-from sgmllib import sgmlparserimport re,osrootdir = ' F:\\Python27\\pythonproj ect\\fuck\\file1\\ ' Rootdir2 = ' f:\\python27\\pythonproject\\fuck\\file2\\ ' class Parse (sgmlparser): Def __init__ (  Self,filename): self.filename = filename Self.comment = ' sgmlparser.__init__ (self) def Reset (self): self.found_comment = False Sgmlparser.reset (self) def start_comment (self, attrs): SE        Lf.found_comment = True def end_comment (self): self.found_comment = False def handle_data (self, text): if self.found_comment = = True:self.comment=solve (self.filename,text) def Solve (Filename,text): #找到3个或3个以 Place above the grid, replace with ' \ t ' strinfo = Re.compile (' (\s\s\s*) ') Str_result = strinfo.sub (' \ t ', text) return Str_result if __na me__ = = ' __main__ ': Global rootdir,rootdir2 for Parent,dirnames,filenames in Os.walk (rootdir): #三个参数: Return separately  1. Parent Directory 2. All folder names (without paths) 3. All file names      For filename in filenames: #print filename filesource = open (Rootdir+filename, ' A +            ') s = Filesource.read () filesource.close () #初始化实例p, feed () function to add processing string             p = Parse (filename) p.feed (s) filedes = open (Rootdir2+filename, ' A + ') #因为原来原数据中存在不以 </comment> end of the content, so need to do first processing pattern1= ' </comment> ' if Re.findall (PA ttern1,s) = = []: s=s+ ' \n</comment> ' <span style= "White-space:pre" ></span> #            Navigate to the Comment label position ([\s\s]*) to match all characters, including line breaks, and then replace Strinfo = Re.compile (' <comment> ([\s\s]*) </comment> ') Str_result = strinfo.sub (' <comment> ' +p.comment+ ' </comment> ', s) filedes.write (s Tr_result) Filedes.close ()

Results:

Note: The total number of files in the Win7 System General folder (including folders) cannot be more than 21,845, otherwise it will prompt an error.

A simple implementation of document extraction and parsing in Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A simple implementation of document extraction and parsing in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A simple implementation of document extraction and parsing in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support