Python Text Parser I. Course INTRODUCTION
This course explains a small program that uses Python to parse plain text to generate an HTML page.
Second, related technology
Python: An object-oriented, interpreted computer programming language that can be used for WEB development, graphics processing, text processing, and mathematical processing, and so on.
HTML: Hypertext Markup Language, used primarily to implement Web pages.
Third, the project
Plain text files:
Welcome to ShiYanLouShiYanLou is the first experiment with IT as the core of online education platform.*Our aim is to do the experiment, easy to learn IT*.Course-Basic Course-Project Course-Evaluation CourseContact us-Web:http://www.shiyanlou.com-QQ Group:241818371-E-mail:[email protected]
After parsing the generated HTML page as
Iv. Project explanation 1. Text Block Builder
First we need to have a text block generator that divides plain text into a single block of text so that each text is parsed quickly, and the util.py code is as follows:
#!/usr/bin/python# encoding:utf-8def lines (file): "" "generator, add a blank line" "" for line in file: yield line yield \ n ' def Span class= "Hljs-title" >blocks (file): "" "generator, generate separate text block" "" block = [] for line in lines (file): If Line.strip (): Block.append (line) elif Block: yield
2. Handling Procedures
Through the text generator we get a block of text, and then need to have a handler for the different blocks of text with the corresponding HTML tags, the handlers.py code is as follows:
#!/usr/bin/python# Encoding:utf-8ClassHandler:"" Handler Parent Class ""DefCallback(self, prefix, name, *args): Method = GetAttr (self, prefix + name,None)If callable (method):Return method (*args)DefStart(self, name): Self.callback (' Start_ ', name)DefEnd(self, name): Self.callback (' End_ ', name)DefSub(self, name):DefSubstitution(match): result = Self.callback (' Sub_ ', name, match)If resultIsNone:result = Match.group (0)return resultreturn substitutionClassHtmlrenderer(Handler):"" "HTML handler, add the corresponding HTML tag" "to the text blockDefStart_document(self):Print' DefEnd_document(self):Print' </body>DefStart_paragraph(self):Print' <p style= ' color: #444; " > 'DefEnd_paragraph(self):Print' </p> 'DefStart_heading(self):Print' DefEnd_heading(self):Print' DefStart_list(self):Print' <ul style= ' color: #363736; " > 'DefEnd_list(self):Print' </ul> 'DefStart_listitem(self):Print' <li> 'DefEnd_listitem(self):Print' </li> 'DefStart_title(self):Print'
DefEnd_title(self):Print' DefSub_emphasis(self, Match):Return' <em>%s</em> '% match.group (1)DefSub_url(self, Match): return ' <a target= "_blank" style= "Text-decoration:none;color: #BC1A4B;" href = "%s" >%s</a> '% (Match.group (1), Match.group (1)) def Sub_mail(self, Match): return c11> ' <a style= "Text-decoration:none;color: #BC1A4B;" href= "mailto:%s" >%s</a> "% (Match.group (1), Match.group (1)) def feed(self, data): print data
3. Rules
With handlers and text block generators, you'll need some rules to determine what markup each chunk of text will give the handler, and the rules.py code is as follows:
#!/usr/bin/python# Encoding:utf-8ClassRule:"" "Rule Parent Class" "DefAction(self, block, handler):"" "Tagged" "" "Handler.start (Self.type) handler.feed (block) handler.end (Self.type)ReturnTrueClassHeadingrule(Rule):"" "A title Rule" "" type =' Heading 'DefCondition(self, Block):"" Determines whether the text block conforms to the rule "" "ReturnNot' \ n 'In blockand Len (block) <=70andNot block[-1] = =‘:‘ClassTitlerule(Headingrule):"" "Second title rule" "" type =' title ' first =TrueDefCondition(self, Block):IfNot Self.first:ReturnFalse Self.first =FalseReturn Headingrule.condition (self, block);ClassListitemrule(Rule):"" List item Rule "" "type =' ListItem 'DefCondition(self, Block):Return block[0] = =‘-‘DefAction(self, block, handler): Handler.start (Self.type) handler.feed (block[1:].strip ()) Handler.end (Self.type)ReturnTrueClassListrule(Listitemrule):"" List Rule "" "type =' List ' inside =FalseDefCondition(self, Block):ReturnTrueDefAction(self, block, handler):IfNot self.insideand Listitemrule.condition (self, Block): Handler.start (self.type) self.inside = True elif self.inside and not listitemrule.condition (self, Block): Handler.end (self.type) self.inside = false return false class Paragraphrule(rule): "" " paragraph Rule" "" type = ' paragraph ' def condition(self, Block): return True
4. Parsing
Finally, we can parse the markup.py code as follows:
#!/usr/bin/python# Encoding:utf-8Import SYS, REFrom handlersImport *From UtilImport *From rulesImport *ClassParser:"" "Parser Parent Class" "Def__init__(self, handler): Self.handler = Handler self.rules = [] Self.filters = []DefAddRule(Self, rule):"," Add Rule "" "self.rules.append (rule)DefAddFilter(self, pattern, name):"" Add Filter "" "DefFilter(block, Handler):return re.sub (Pattern, handler.sub (name), block) Self.filters.append (filter)DefParse(Self, file):"" "" "" "" "Self.handler.start (' Document ')For blockIn blocks (file):For filterIn Self.filters:block = Filter (block, Self.handler)For ruleIn Self.rules:If Rule.condition (block): last = rule.action (block, Self.handler)If last:Break Self.handler.end (' Document ')Classbasictextparser (Parser): "" "Plain text parser "" "def __init__r ' \* (. +?) \* ', ' emphasis ') Self.addfilter (r ' (http://[\.a-za-z/]+) ', Span class= "hljs-string" > ' URL ') self.addfilter (r ' ([\.a-za-z][email protected][\. a-za-z]+[a-za-z]+) ', ' mail ') "" "Run Program" "Handler = Htmlrenderer ( ) parser = Basictextparser (handler) parser.parse (Sys.stdin)
Run the program (plain text file is test.txt, generate HTML file as test.html)
< test.txt > test.html
Five, code download
You can download the relevant code for this course using the following command:
clone http://git.shiyanlou.com/shiyanlou/python_markup
Vi. Summary
In this applet, we use Python to parse plain text files and generate HTML files, this is just a simple implementation, through this case you can try to parse the Markdown file.
Python Text Parser