The HTML2MD of web crawler

Source: Internet
Author: User
Objective

Web articles crawled by Java last week, have not been able to use Java to implement the HTML conversion MD, a full week to solve.

Although I do not have a lot of blog posts, but I do not despise the manual conversion, after all, manual conversion waste time, the time used to do something else is also good.

Design Ideas Java Implementation

At the beginning of the idea is to use Java to parse the HTML, thinking of a variety of label parsing, symbolic parsing, regular replacement, and so on, decided to search for a wave on GitHub, it is true that there are predecessors realized, immediately ecstatic;

Code address

After downloading as

The HTMLTOHEXOMD method can be used to test the run

Perhaps the author is the path defined on the Linux server, I have been testing the path problem has been prompted, the result was forced to change the path code of conversion,

Debug after running the generated MD file, local start Hexo service, upload just generated MD file, Web browsing, dissatisfaction, discard it.

Nodejs implementation

Why suddenly choose Nodejs to achieve, just recently in the View node books, there is mentioned node crawler, parse crawl content, the book mentioned the use of Cheerio module, then decisively browse its API documentation, Cheerio is actually a replica of jquery, this can be convenient, heart exultation.

Implementation ideas

Achieve a single conversion

Custom parsing

Achieve Batch Conversions

Analysis of difficulties

Custom parsing is a headache, it is necessary to analyze the format of the HTML to be converted, the content that needs to be read, I h1,h2,h3,div,img,a the label to do the processing, can expand their own

The HTML parsing code is as follows

if(' p ' = = = =name) {                if(E_children.type = = = ' Text '){                    if(E.children.length > 1){                         for(varj=0,c_len=e.children.length;j<c_len;j++){                            if(e.children[j][' name '] = = = ' a ') WriteData = WriteData + ' (' +e.children[j].attribs.href + ') \ r \ n '; Else if(e.children[j][' type '] = = = ' text ') WriteData = writedata + e.children[j].data + ' \ r \ n '; }                    }ElseWriteData = writedata + e.children[0].data + ' \ r \ n '; }Else if(E_children.name = = = ' img ') writedata = WriteData + '! [Image] (' +e.children[0].attribs.src + ') \ r \ n '; }Else if(' div ' = = = =name) {                varCodes = $ (' #cnblogs_post_body. Cnblogs_code pre '). EQ (code_idx++). text (); Codes= Codes.replace (/^ (\s*) \d+/gm, "); WriteData= WriteData + ' bash\r\n ' + codes + ' \ r \ n ' \ r \ n '; }Else if(' h1 ' = = = name) WriteData = WriteData + ' # ' + e_children.data + ' \ r \ n '; Else if(' h2 ' = = = name) WriteData = WriteData + ' # # ' + e_children.data + ' \ r \ n '; Else if(' h3 ' = = = name) WriteData = WriteData + ' # # ' + e_children.data + ' \ r \ n ';
Conclusion

Complete code please move to my github, if this article is useful to you please do not hesitate to star

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: