Objective
Web articles crawled by Java last week, have not been able to use Java to implement the HTML conversion MD, a full week to solve.
Although I do not have a lot of blog posts, but I do not despise the manual conversion, after all, manual conversion waste time, the time used to do something else is also good.
Design Ideas Java Implementation
At the beginning of the idea is to use Java to parse the HTML, thinking of a variety of label parsing, symbolic parsing, regular replacement, and so on, decided to search for a wave on GitHub, it is true that there are predecessors realized, immediately ecstatic;
Code address
After downloading as
The HTMLTOHEXOMD method can be used to test the run
Perhaps the author is the path defined on the Linux server, I have been testing the path problem has been prompted, the result was forced to change the path code of conversion,
Debug after running the generated MD file, local start Hexo service, upload just generated MD file, Web browsing, dissatisfaction, discard it.
Nodejs implementation
Why suddenly choose Nodejs to achieve, just recently in the View node books, there is mentioned node crawler, parse crawl content, the book mentioned the use of Cheerio module, then decisively browse its API documentation, Cheerio is actually a replica of jquery, this can be convenient, heart exultation.
Implementation ideas
Achieve a single conversion
Custom parsing
Achieve Batch Conversions
Analysis of difficulties
Custom parsing is a headache, it is necessary to analyze the format of the HTML to be converted, the content that needs to be read, I h1,h2,h3,div,img,a the label to do the processing, can expand their own
The HTML parsing code is as follows
if(' p ' = = = =name) { if(E_children.type = = = ' Text '){ if(E.children.length > 1){ for(varj=0,c_len=e.children.length;j<c_len;j++){ if(e.children[j][' name '] = = = ' a ') WriteData = WriteData + ' (' +e.children[j].attribs.href + ') \ r \ n '; Else if(e.children[j][' type '] = = = ' text ') WriteData = writedata + e.children[j].data + ' \ r \ n '; } }ElseWriteData = writedata + e.children[0].data + ' \ r \ n '; }Else if(E_children.name = = = ' img ') writedata = WriteData + '! [Image] (' +e.children[0].attribs.src + ') \ r \ n '; }Else if(' div ' = = = =name) { varCodes = $ (' #cnblogs_post_body. Cnblogs_code pre '). EQ (code_idx++). text (); Codes= Codes.replace (/^ (\s*) \d+/gm, "); WriteData= WriteData + ' bash\r\n ' + codes + ' \ r \ n ' \ r \ n '; }Else if(' h1 ' = = = name) WriteData = WriteData + ' # ' + e_children.data + ' \ r \ n '; Else if(' h2 ' = = = name) WriteData = WriteData + ' # # ' + e_children.data + ' \ r \ n '; Else if(' h3 ' = = = name) WriteData = WriteData + ' # # ' + e_children.data + ' \ r \ n ';
Conclusion
Complete code please move to my github, if this article is useful to you please do not hesitate to star