Use Python to work with TXT-formatted novels
Vim is indeed an artifact, but Sed and vim are not entirely universal. This article is inspired by the "re-typesetting of the novel in TXT format with vim", thanks!
Often download txt ebook, format is not the mind, had to deal with themselves. The first thing is to deal with the line in the paragraph.
The original intention is to customize a vim mode, when the time to deal with the novel into the mode, and then use a variety of shortcut keys. In order to avoid the TXT shortcut key to the day-to-day programming interference. It turned out that vim, unlike Emacs, could customize its own pattern. (may be able to customize the dedicated VIMRC solution, without trying)
So turn to script to find a solution. Sed and awk are among the leaders, first try. Unfortunately, the sed of the earlier days has been forgotten, and there is no more concise and clear solution. sed, like grep, reads a row, deletes N, performs various processing, writes the file, and then adds N. n can be read to the next line to the current pattern matching space to be processed again. But I need to match the whole file and I can't find a solution for the moment.
Had to turn to Python again. Python has its own re module, which should be fine. Re.sub can be replaced. It took some time to match the Chinese. In Vim, you can match double-byte characters with [^x00-xff], but it doesn't work in Python. After Google, we found that we can match Chinese characters with [X80-xff] (Perl, which seems to support the Chinese in both cases).
At this point, the problem is initially resolved:
View sourceprint?01 #!/usr/bin/env python
#encoding =utf-8
Import re
From sys import ARGV
05
06
Modified if __name__ = = ' __main__ ':
If Len (argv)!= 2:
print ' Usage:filename '
Ten Else:
One FH = open (argv[1], ' R ')
Content = Fh.read ()
[out = Re.sub (' n ([x80-xff]) ', R ' 1 ', content)
Print out
Specification Header:
View sourceprint?01 #!/usr/bin/env python
#encoding =utf-8
Import re
From sys import ARGV
05
06
Modified if __name__ = = ' __main__ ':
If Len (argv)!= 2:
print ' Usage:filename '
Ten Else:
One FH = open (argv[1], ' R ')
Content = Fh.read ()
[out = Re.sub (' + ([x80-xff]) ', R ' 1 ', content)
Print out
Of course, the downloaded documentation is usually gb2312 and needs to be converted to UTF8 for further processing, referring to my Python Chinese code note.