Python deals with TXT-style novels

Source: Internet
Author: User
Tags in python

Use Python to work with TXT-formatted novels
Vim is indeed an artifact, but Sed and vim are not entirely universal. This article is inspired by the "re-typesetting of the novel in TXT format with vim", thanks!

Often download txt ebook, format is not the mind, had to deal with themselves. The first thing is to deal with the line in the paragraph.

The original intention is to customize a vim mode, when the time to deal with the novel into the mode, and then use a variety of shortcut keys. In order to avoid the TXT shortcut key to the day-to-day programming interference. It turned out that vim, unlike Emacs, could customize its own pattern. (may be able to customize the dedicated VIMRC solution, without trying)

So turn to script to find a solution. Sed and awk are among the leaders, first try. Unfortunately, the sed of the earlier days has been forgotten, and there is no more concise and clear solution. sed, like grep, reads a row, deletes N, performs various processing, writes the file, and then adds N. n can be read to the next line to the current pattern matching space to be processed again. But I need to match the whole file and I can't find a solution for the moment.

Had to turn to Python again. Python has its own re module, which should be fine. Re.sub can be replaced. It took some time to match the Chinese. In Vim, you can match double-byte characters with [^x00-xff], but it doesn't work in Python. After Google, we found that we can match Chinese characters with [X80-xff] (Perl, which seems to support the Chinese in both cases).

At this point, the problem is initially resolved:


View sourceprint?01 #!/usr/bin/env python

#encoding =utf-8

Import re

From sys import ARGV

05

06

Modified if __name__ = = ' __main__ ':

If Len (argv)!= 2:

print ' Usage:filename '

Ten Else:

One FH = open (argv[1], ' R ')

Content = Fh.read ()

[out = Re.sub (' n ([x80-xff]) ', R ' 1 ', content)

Print out


Specification Header:


View sourceprint?01 #!/usr/bin/env python

#encoding =utf-8

Import re

From sys import ARGV

05

06

Modified if __name__ = = ' __main__ ':

If Len (argv)!= 2:

print ' Usage:filename '

Ten Else:

One FH = open (argv[1], ' R ')

Content = Fh.read ()

[out = Re.sub (' + ([x80-xff]) ', R ' 1 ', content)

Print out


Of course, the downloaded documentation is usually gb2312 and needs to be converted to UTF8 for further processing, referring to my Python Chinese code note.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.