1. Preface
I believe that many developers have such a need: accurately extract the required content from the web page. After thinking about it, there are only the following methods)
1. Use a regular expression to match the required elements. (Disadvantage: if elements of the same type have different attributes, such as <div class = 'first'> aaa </div> <div class = 'last'> bbb </div>, if you want to match all div elements, it will be quite troublesome and easy to get unwanted results and missing the expected results .)
2. Convert the webpage into an XML document and use the Linq to XML file. (Disadvantage: A conversion process is required and the efficiency is not high .)
3. Use the WebServices or WebAPI provided by the website to directly obtain the required data. (Disadvantage: You need to obtain the interface documentation first, which is generally not provided anonymously .)
With the rise of the front-end in recent years, more and more people have begun to understand JQuery, a powerful tool, and have been impressed by it. The most important thing is the JQuery selector, Which is concise and efficient, the ease of study greatly improves the efficiency of front-end engineers. Think about it. Extracting the webpage content is dealing with the front-end. If you can use the JQuery selector, It will be perfect !!!
2. Theoretical preparation
Is it necessary to create a selector in. NET? No, this is not something I can do for the younger generation .... Since JQuery already exists, why can't I simply use its selector?
1.. NET get webpage content
Here you can select the webbrowser control. In fact, it is a micro-IE, which can be done by IE. Some may ask why you don't need to use WebClient to directly download the webpage content? Please refer to the second point.
2.. NET interaction with JS
The webbrowser control not only provides webpage content, but also provides webpage interaction functions. With the built-in Document attribute, we can inject the required JS Code into the webpage and execute it.
3. Extract and return the required content
In. NET, we can use the InvokeScript function of Docment to execute the corresponding JS function and obtain the returned result.
Now that all theories are ready, let's implement it.
3. function implementation
Test page: http://www.mmeinv.com/(Welfare site Oh, but no evil content, Please edit Mingjian !)
Functional requirements: Extract all "benefits "!!!!
First:
We can see from the figure that "benefits" are extracted accurately. You can obtain only the required attribute values. All you have to do is enter 15 characters in length.
Next let's look at the code implementation:
Wb is the webbrowser control, which is used to inject the JQuery library into webpages that do not include the JQuery library.
void InjectJQuery() { HtmlElement jquery = wb.Document.CreateElement("script"); jquery.SetAttribute("src", "http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"); wb.Document.Body.AppendChild(jquery); JQueryInjected = true; }
Here is the JS function to be executed for injection. Because different requirements have different codes, repeated injection is not allowed. When the requirement changes, you only need to change the injected function.
JQScript = wb.Document.GetElementById("JQScript"); if (JQScript == null) { JQScript = wb.Document.CreateElement("script"); JQScript.SetAttribute("id", "JQScript"); JQScript.SetAttribute("type", "text/javascript"); wb.Document.Body.AppendChild(JQScript); }
Here is the key code. Different codes are generated based on whether to extract attributes. The injected code is very simple. I believe that a friend who understands the front-end will understand it at a glance. The last line of code is to execute the injected function and obtain the return value.
if (txtAttribute.Text.Trim() == string.Empty) JQScript.SetAttribute("text", "function GetJQValue() { if ($('" + txtSelector.Text + "').length == 1) {" + "return $('" + txtSelector.Text + "')[0].outerHTML; }" + " else if ($('" + txtSelector.Text + "').length > 1) {" + " var allhtml = '';" + " $('" + txtSelector.Text + "').each(function() {allhtml=allhtml+$(this)[0].outerHTML+'\\r\\n';});" + " return allhtml;}" + " else return 'no item found.';}"); else { JQScript.SetAttribute("text", "function GetJQValue() { if ($('" + txtSelector.Text + "').length == 1) {" + "return $('" + txtSelector.Text + "').attr('" + txtAttribute.Text + "'); }" + " else if ($('" + txtSelector.Text + "').length > 1) {" + " var allhtml = '';" + " $('" + txtSelector.Text + "').each(function() {allhtml=allhtml+$(this).attr('" + txtAttribute.Text + "')+'\\r\\n';});" + " return allhtml;}" + " else return 'no item found.';}"); } textBox2.Text = wb.Document.InvokeScript("GetJQValue").ToString();
I believe that it is clear to everyone that just a few lines of code can be used with powerful JQuery selectors, which are several times more efficient than the old method. Why not?
4. Knowledge Extension
1. As long as your front-end knowledge is hard enough, you can inject more complex functions to achieve more complex content extraction.
2. In Android and IOS, this function can also be implemented theoretically.
3. Maybe one day, we will have a similar selector to replace SQL to implement efficient database queries ??????????
PS: the younger brother has a very poor understanding. If you have any questions, please make an axe!
PPS: I also want to talk about it. The copyright of the article is owned by myself. Please mark and keep the original article link for reprinting. Thank you!
Demo and source code: http://files.cnblogs.com/XiaoFaye/JQuerySelector.zip