I was planning to analyze the page in the event, but after half a day, I did not call the processing function, and the search was solved, I would like to thank the original author, webbroswer control navigate, for generating a page before obtaining WB. htmldocument then analyzes the elements and labels of htmldocument. In fact, the collection does not only collect a single page. In the main form, the collection can be completed. For example, if you collect several list pages and have more than N pages, one cycle goes down, when webbrowser is used to respond, it will lead to a false death. At this time, we will surely think of using multiple threads to do the C # multithreading. We should all know that there are two modes, STA and MTA. However, the webbrowser control has a bad characteristic: It only supports the multi-thread sta mode, such as the following code,
Thread tread = new thread (New parameterizedthreadstart (begincatch ));
Tread. setapartmentstate (apartmentstate. Sta );
Tread. Start (URL );
Code
Private void begincatch (Object OBJ)
{
String url = obj. tostring ();
Webbrowser WB = new webbrowser ();
WB. scripterrorssuppressed = true;
WB. navigate (URL );
WB. documentcompleted + = new webbrowserdocumentcompletedeventhandler (wb_documentcompleted );
}
When you need to analyze the htmldocument generated by webbrowser, you must perform operations in the event documentcompleted. Only in this case can webbrowser load be completed. This is just a trap !!!! Webbrowser has a feature, that is, when the multi-thread sta is used, it simply does not wait for the execution of documentcompleted, that is, it cannot perform subsequent operations !!! In this case, what should we do? Someone may think of the WB. Document. Write (string) method as follows:
Code
Private void begincatch (Object OBJ)
{
String url = obj. tostring ();
Webbrowser WB = new webbrowser ();
WB. scripterrorssuppressed = true;
String htmlcode = gethtmlsource (URL );
WB. Document. Write (htmlcode );
// Perform the analysis operation
}
// Obtain the web page source code from WebClient
Private string gethtmlsource (string URL)
{
String text1 = "";
Try
{
System. net. WebClient WC = new WebClient ();
Text1 = WC. downloadstring (URL );
}
Catch (exception exception1)
{}
Return text1;
}
But at this time, we will find that WB. documenttext is always absent. At that time, I was also very depressed. I searched for articles in the garden and msdn, all of them can be assigned values using documenttext, but I also found a lot on the Internet that I did not get any results after I tried to search in the garden and found an example mentioned in a useful article. After testing it is found that the webbrowser must be navigate before a document is generated, finally, we can implement the following multi-threaded operations. The final code is as follows:
Code
Private void threadwebbrowser (string URL)
{
Thread tread = new thread (New parameterizedthreadstart (begincatch ));
Tread. setapartmentstate (apartmentstate. Sta );
Tread. Start (URL );
}
Private void begincatch (Object OBJ)
{
String url = obj. tostring ();
Webbrowser WB = new webbrowser ();
WB. scripterrorssuppressed = true;
// Navigate a blank page here
WB. navigate ("about: blank ");
String htmlcode = gethtmlsource (URL );
WB. Document. Write (htmlcode );
// Perform the analysis ...... (Omitted)
}
// Obtain the web page source code from WebClient
Private string gethtmlsource (string URL)
{
String text1 = "";
Try
{
System. net. WebClient WC = new WebClient ();
Text1 = WC. downloadstring (URL );
}
Catch (exception exception1)
{}
Return text1;
}
Of course, when processing each node and database operation in the thread, the effect and performance of threadpool may be better. Finally, let's talk about some suggestions for improvement, to process a webpage containing Chinese characters, the gethtmlsource (string URL) given above)
The retrieved data may be garbled. I used webclent to download the data and set the appropriate character set without garbled characters,