0. Introduction
Encountered in real-world application development: Storing unformatted file data in a database. For traditional formatted data: Ini/json/xml We all have a ready-made class library to implement. What about non-formatted data like this? Here are my thoughts and implementations.
The data interception fragment is as follows:
[[Email protected] .]# head-n Input.txt[Url]http://epaper. Tianjinwe. com/mrxb/mrxb/ -- Geneva/ +/content_7566593. htmNew News "reporter Wang Jing correspondent Zhaoziqiang" to31.66Million square meters of old buildings for renovation, completed the two kindergarten, the full-year new employment3600People, urban and rural residents health insurance coverage rate up to -%...... Jinnan District Xian shui gu zhen .The annual service for the people of the 10 projects identified, involving infrastructure, education, environmental governance, the increase in residents ' protection, difficult mass life and many other aspects. This year, salt water will speed up the demonstration town construction process, the start of the four-Li Village residential demolition, the completion of the three period of Boya fashion16.5Million square meters also moved to complete supporting work, completed the east Zhang Zhuang, Beiyang Village also moved to work, the start of Jinfeng 四、五号 Library project31.15Million square meters also moved to work, to ensure that Wu Shiti, Li Zhangzi, Panzhong masses smooth still move. with [Url]http://epaper. Tianjinwe. com/mrxb/mrxb/ -- Geneva/ +/content_7566617. htm2Month +Day of the week Two days Tianjin TV (101) +: -Happy Life Theatre: Starlight (5、6) +: -Happiness to knock on the door Day view1Nested102) -: -Urban reporting -Score of +: -The1Observation +:xxNews extension +: -Hit1Hour Day View2Nested103) -: +Colorful Theater: Husbands ' money ( -); our (1、2) +:TenMusic Vision Day Vision3Nested104) -:xxEight o'clock in the evening Theatre: the youth of the God of War ( +- at) A: +Evening Theatre: The Mountains and rivers ( the、 +) Day Vision5Nested106) -:xxHit +: -I am Chess ( at) +: *Scientific fitness a little pass +: $Lead A:xxThe king sees the brand day6Sets
1. Discussion of Ideas
1) Convert to formatted data.
How to format the text content of a large segment of a variable containing a newline and any special characters is a problem?
2) file read, stored as a different two variables. According to the characteristics of the file, nothing is: URL as key, the content of Chinese characters as value. Use map or HashMap storage. Whether you read a file in C + + or Java, this is a very small amount of work overhead. But my time is only allowed for about 1 hours.
In summary, choose Shell Script to do the format processing.
The general idea is:
1) The URL line is reserved for easy extraction.
2) The remaining unformatted text, delete blank lines, delete line breaks, add content tags, easy to extract.
3) Extract a URL at a time corresponding to the content, constructed to the desired SQL.
2, the core implementation step 1th step: Format Text file
Add content= to the header of the next line of the URL line to facilitate retrieval.
-i‘/^\[url/ { n; s/^/content=/; }‘$RST_FILE
2nd step: Delete empty lines
-i‘/^$/d‘$RST_FILE
3rd step: Extract the URL
$RST_FILEgrep$URL_FILE
4th step: Delete the processed URL line
-i‘/url/d‘$RST_FILE
5th step: Replace the newline character with a space
-i‘:a;N;$ s/\n/ /g;ba‘$RST_FILE
6th step: Add a line break before content
-i‘s#content#\ncontent#g‘$RST_FILE
7th Step: Extract content to Content.txt
$RST_FILEgrep$CONTENT_FILE
3, Foot source code
Split into two files to traverse by line.
#!/bin/shContent_file=./content.txturl_file=./url.txtrst_file=./input.txt#格式化文件functionformat_process() {Sed-i'/^\[url/{n; s/^/content=/;} ' $RST _fileSed-i'/^$/d ' $RST _fileCat$RST _file| grep URL >$URL _file #删除处理过的url行Sed-i'/url/d ' $RST _fileSed-i': A; n;$ s/\n//g;ba ' $RST _fileSed-i' S#content#\ncontent#g ' $RST _fileCat$RST _file| grep content >$CONTENT _file}#生成sqlfunctionbuild_rstdate() {icnt=1; Cat$CONTENT _file| while ReadLine DoMkdir-p./output#生成每个独立的content文件 Echo $line>./output/content_${icnt}. txt sed-i' s#content\=# #g './output/content_${icnt}. txt icnt=$[$icnt+1];EchoIcnt=$icnt; Done;ExportGcnt=0; iurlcnt=0; Cat$URL _file| while ReadLine Doiurlcnt=$[$iurlcnt+1];Echo $iurlcnt>./output/.cnts_rst.txt#生成每个独立的url文件 Echo $line>./output/url_${iurlcnt}. txt sed-i' s#\[url\]# #g './output/url_${iurlcnt}. txt#export gcnt= $iurlcnt; Done; gcnt= ' Cat./output/.cnts_rst.txt 'EchoGcnt=$gcnt#构造成sql文件Cat/dev/null > Update_sql.sql for((i=1; i<=$gcnt; i++)) DoUrl= ' Cat./output/url_${i}. txt '; Content= ' Cat./output/content_${i}. txt ';# echo url= $url # echo content= $content Echo "Update gather_rst set content= '$content' where url= '$url';">> Update_sql.sql Done;} Format_process;build_rstdate;
Formatting an XML script implementation
[[Email protected] .]# cat Build_input.sh#!/bin/shSed-i' s#</content>#</contentsize>#g ' Input.xmlSed-i' s#<content>#<contentsize>#g ' Input.xmlSed-i' s#</snapshot>#</snapshotsize>#g ' Input.xmlSed-i' s#<snapshot>#<snapshotsize>#g ' Input.xmlSed-i' s#<is_site_homepage>#</is_site_homepage>#2 ' Input.xml#在文件头插入格式化字符串Sed-i' 1i\<?xml version= ' 1.0 ' encoding= ' UTF-8 '?> ' input.xmlSed-i' 2i\<HotNewsList>' Input.xml#文件末尾加入特定字符串Sed-i' $a \</HotNewsList>' Input.xml
4. Summary
The Shell's handling of the text is really powerful. Some command lines can not be "handy", need to further grasp the improvement!
20170222 22:36 in front of home bed
Ming Yi World
Reprint please indicate the source, the original address:
http://blog.csdn.net/laoyang360/article/details/56510665
If you feel this article is helpful, please click on the ' top ' support, your support is I insist on writing the most power, thank you!
"Lazy shell script" seven--format processing data into the database implementation