A simple implementation of document extraction and parsing in Python

Source: Internet
Author: User

First, the previous article

was called to do web crawler, crawling Sina News URL, title, content and comments, but a little change in demand, mainly is the comment interval is required to have ' \ t ' division, such as the

<comment>
2014-12-10 18:53:20 1004400533 abandon their own flesh and blood pig dog, is not afraid to be condemned by conscience? Can you sleep soundly?
2014-12-10 17:17:07 3294923134 is this father a man?
</comment>

Switch

<comment>
2014-12-10 18:53:20 1004400533 abandon their own flesh and blood pig dog, is not afraid to be condemned by conscience? Can you sleep soundly?
2014-12-10 17:17:07 3294923134 is this father a man?
</comment>

Try to re-crawl, but Sina in 2015 years later, made some changes, one comment API content has some changes, from the original Chinese characters into the ' \u ', the solution can refer to another Bovins tell you the Python crawler when the conversion \u Chinese characters "pit", Second, the crawler needs to do a good job of browser simulation, or it is easy to be blocked, and, when doing large-scale crawl, need to use multi-threaded or distributed means to do, multithreading a little attention, to the system thread upper limit, this is to pay attention to.


Second, extraction and analysis

Because I want to crawl this page (Social) a year of content, has more than 20,000 (only those who have comments on the news), so often the next day to find the halfway error. The distributed computing stuff is not going to learn yet, so it's time to review Python.

In this short program, using the Sgmlparser class for parsing, this class encapsulates the ability to implement a string with a tag processing function, of course, there is no label can be used, do not need to be called HTML text kind of requirements, Many examples on the web may be explained by Htmlparser and Sgmlparser, by parsing x.html. I only use a single-threaded approach here.

Sgmlparser:

1, by using the method of switching variables to extract and parse the contents of the tag, before this, the need to initialize the variables, you can use __init__ () or __reset__ (), generally with __reset__ ().

2, Start_x (), end_x (), Handle_data (), these three functions, the first one is the function that executes when reading to <x>, the first is the function that executes when reading to </x>, the third is when the contents of the tag are encountered, Will call this function, in more detail can refer to this blog post

Variable Description:

RootDir: Original file directory

RootDir: Need to store the converted file directory.

Method Description:

1, read the contents of <comment></comment> by Sgmlparser, call Solve () function, use the regular matching method to replace the original non-conforming space with ' \ t ', Returns the content that needs to be replaced to the comment variable

2. Use regular match to locate <comment></comment> Replace with comment value of Parse class object

(The methods of regular expressions refer to the following comments)

Full code:

#!/usr/bin/env python#-*-coding:utf-8-*-from sgmllib import sgmlparserimport re,osrootdir = ' F:\\Python27\\pythonproj ect\\fuck\\file1\\ ' Rootdir2 = ' f:\\python27\\pythonproject\\fuck\\file2\\ ' class Parse (sgmlparser): Def __init__ (  Self,filename): self.filename = filename Self.comment = ' sgmlparser.__init__ (self) def Reset (self): self.found_comment = False Sgmlparser.reset (self) def start_comment (self, attrs): SE        Lf.found_comment = True def end_comment (self): self.found_comment = False def handle_data (self, text): if self.found_comment = = True:self.comment=solve (self.filename,text) def Solve (Filename,text): #找到3个或3个以 Place above the grid, replace with ' \ t ' strinfo = Re.compile (' (\s\s\s*) ') Str_result = strinfo.sub (' \ t ', text) return Str_result if __na me__ = = ' __main__ ': Global rootdir,rootdir2 for Parent,dirnames,filenames in Os.walk (rootdir): #三个参数: Return separately  1. Parent Directory 2. All folder names (without paths) 3. All file names      For filename in filenames: #print filename filesource = open (Rootdir+filename, ' A +            ') s = Filesource.read () filesource.close () #初始化实例p, feed () function to add processing string             p = Parse (filename) p.feed (s) filedes = open (Rootdir2+filename, ' A + ') #因为原来原数据中存在不以 </comment> end of the content, so need to do first processing pattern1= ' </comment> ' if Re.findall (PA ttern1,s) = = []: s=s+ ' \n</comment> ' <span style= "White-space:pre" ></span> #            Navigate to the Comment label position ([\s\s]*) to match all characters, including line breaks, and then replace Strinfo = Re.compile (' <comment> ([\s\s]*) </comment> ') Str_result = strinfo.sub (' <comment> ' +p.comment+ ' </comment> ', s) filedes.write (s Tr_result) Filedes.close ()


Results:


Note: The total number of files in the Win7 System General folder (including folders) cannot be more than 21,845, otherwise it will prompt an error.

A simple implementation of document extraction and parsing in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.