Use C # to deserialize HTML and get specific content in HTML

Source: Internet
Author: User

Recently, there is a project to get some data on someone else's website down. (In fact, it is a check schedule)

It is not difficult to get HTML source code via HTTP GET ... The difficulty is that the content of this Web page is very confusing and sometimes the format changes.

I started with Python, and I was able to create the DOM directly from the Web. The simplest way to do this is to use jquery to easily get rid of specific content on the site.

But this project is done in the C # language with ASP. I found a lot of plans, all very laborious.

Here are some examples of these scenarios that are quite laborious ... Tell everyone not to learn.

Scenario One: Create a corresponding class based on the website. Then over-XML deserialization

This method is very dead, first set up the corresponding class is enough to build a whole day. The pit daddy in C # is a strong type of language. There is no way to directly manipulate this object. You can only create classes first.

It seems that some libraries can make this process a little bit more comfortable. such as XPath.

Scenario Two: Use string substring and regular expressions to handle HTML

This is a really practical way to write. But the regular expressions and substring I wrote were too difficult to maintain. The code is ugly, inefficient, and often God's exception.

A more efficient solution

Finally, I found a more feasible method. The approximate method is to use a library called Hsharp. Used in C # to handle HTML in a weakly typed language processing way.

Hsharp on GitHub home page: Https://github.com/Obisoft2017/HSharp

It looks pretty good. Open source free, source code is also very short.

Similar libraries also have htmlagility. That's a lot more complicated. C # comes with a document, which is more official, but relies on WebBrowser objects, and ... A stroke of slow speed.

I mainly use it for the HTML "deserialization".

(Strictly, it does not work in a strongly typed deserialization, but instead puts the tags into the dictionary.) At least this can still be queried.)

First execute the command in the VS pm:

1

Then use its Htmlconvert.deserializehtml method directly to deserialize the HTML. Examples of the official website are quite clear.

usingObisoft.HSharp.Models;usingSystem;namespaceobisoft.hsharp{classExample { Public Static voidMain (string[] args) {            varNewDocument = htmlconvert.deserializehtml ($@""\"utf-8\ ""}> <meta name={"\ "Viewport\""}> <title>Example</title>Some Text<table> <tr>OneLine</tr> <tr>TwoLines</tr> <tr>threelines</tr&    Gt </table>Other Text</body>");Console.WriteLine (newdocument["HTML"]["Head"]["Meta",0]. properties["CharSet"]); Console.WriteLine (newdocument["HTML"]["Head"]["Meta",1]. properties["name"]); foreach(varLineinchnewdocument["HTML"]["Body"]["Table"]) {Console.WriteLine (Line.son); }    }}

In the above code, the HTML is deserialized, and the value of the two meta charset and name property of the HTML is printed. It then iterates through all the elements of the table and prints the contents of the element.

Output Result:

Utf-8viewportonelinetwolinesthreelines

Back to Project

I tried to deserialize a real site. Take Obisoft's homepage for example here. Their website is very complex and complex.

Here fake want to get a random value in their website. Here I get a general number on his page:

After examining their code carefully, this number appears in a span within the iteration of the 5th section tag N Heavy div. Write code:

            varWebsiteresult = htmlconvert.deserializehtml (NewUri ("http://www.obisoft.com.cn/")); Console.WriteLine (neuaaoresult["HTML"]["Body"][" Section",5]["Div"]["Div"]["Div"]["Div"]["Div"]["Div",2]["span"]. Son);

Sure enough, the program successfully output 10. The speed is still very fast.

Considering that this library is really useful and powerful enough. Prepare research and write some interfaces. It seems there are a lot of ways, and it should be possible to build HTML.

Use C # to deserialize HTML and get specific content in HTML

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.