(Several bloggers have mentioned the problem of false positives. It may be that the Environment Network I tested was in good state, and the machine configuration was okay, so I didn't see the phenomenon of false positives. I just changed the environment, and it is obvious to be suspended. So you will be looking forward to the multi-threaded version .)
First of all, I would like to thank you for your guidance.ProgramThe multi-threaded version has been changed, and the zombie problem has been solved. PairCodeSome optimizations have also been made. Previously, a single request was used to capture the first page, and another request was used to capture the first page. Later, the modifications were made, I suddenly found that the data on the first page can be captured using the following method, so the first step is to do so.
I have been busy for nearly two weeks and finally finished it. It may be of no value or challenge for most people, but it is a little sense of accomplishment for me who first came to winform.
Let's start with two programs.
The interface is relatively simple, and it is not complicated now. There is also a single-choice button for other spaces. This function has not yet been implemented and is intended to be expanded later. The first of the two text boxes is the QQ number, and the second is the total number of pages of the log. The log PAGE analysis has not been found for a long time, so I have to make this decision for the user to enter it. The attacker cannot capture the space with access permissions, even if he knows the answer to the question. Then, you need to add a password input box and simulate a space master's permissions to capture it.
Enter the QQ number and log page number (for the QQ number and log page number that have not been verified -_-!!), The page will automatically jump to the log list page of the space, and click start crawling at the top to automatically run the program. The gray section below shows the log capture progress.
Coding is relatively short, and a lot of time is spent on analysis. QQ space is a relatively different blog. The whole page is a large log list in the middle of the master page. All the log Content and other photo albums are dynamically loaded using Ajax requests, so I have been wondering why there are not many pictures or content in the QQ space that consumes resources. The content Dynamically Loaded by Ajax requests is stored in the memory.
At the beginning, I was preparing to use the console program to solve this problem. Because I wrote a little spider due to work, I thought it would be almost the same if I changed it, however, because the previous program logic is relatively complex, you do not know where to change it. After communicating with Boyou, you chose to use webbrowser.
The default webbrowser property documenttext can be used to retrieve the text on the page directly, but the problem is that the content loaded dynamically cannot be obtained at all. In firebug, the loaded content is in an IFRAME. At this time, I always thought it would be enough to directly obtain the SRC of IFRAME. Then I thought a lot of ways to use webbrowser to get the content in IFRAME. It took two nights. At this time, I haven't thought that IFRAME is dynamically added by Ajax, so there is no way to get the loaded IFRAME.
I continued to look for a solution on the Internet and accidentally saw a question. If I got all the content Dynamically Loaded, even though I didn't get the answer to that question, however, I am sure that the log Part of the QQ space is dynamically loaded. At this time, I calmed down and carefully analyzed the sent requests under firebug. I tried to write a GET request and assembled the parameters in firebug into a string, so I got the log Content, however, the problem is that the obtained content is encrypted, Gzip is encrypted, and the method for decryption continues on the Internet. After the decryption is successful, there will be more than n Regular matches. Let's just talk about it. Let's start with a flowchart.
I personally think the most important part of the code is to send a request, and then analyze the HTML to match the desired part with a regular expression.
/// <Summary>
/// Send a GET request
/// </Summary>
/// <Param name = "url"> Requested URL </Param>
/// <Param name = "Parameters"> Set of Request Parameters </Param>
/// <Param name = "reqencode"> Request Encoding format </Param>
/// <Param name = "resencode"> Received encoding format </Param>
/// <Returns> </returns>
Public Static String Sendgetrequest ( String URL, namevaluecollection parameters, encoding reqencode, encoding resencode)
{
Stringbuilder parassb = New Stringbuilder ();
If ( Null ! = Parameters)
{
Foreach ( String Key In Parameters. Keys)
{
If (Parassb. Length > 0 )
Parassb. append ( " & " );
Parassb. appendformat ( " {0} = {1} " , Httputility. urlencode (key), httputility. urlencode (parameters [Key]);
}
}
If (Parassb. Length > 0 )
{
URL + = " ? " + Parassb;
}
Httpwebrequest req = (Httpwebrequest) webrequest. Create (URL );
Req. Method = " Get " ;
Req. Accept = " */* " ;
Req. headers. Add ( " Accept-encoding: Gzip " );
Req. keepalive = True ;
Req. useragent = " Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; sv1;. Net CLR 1.1.4322;. Net CLR 2.0.50727) " ;
Req. maximumautomaticredirections = 3 ;
Req. Timeout = 600000 ;
String Result = String. empty;
// Read response
Httpwebresponse response = (Httpwebresponse) Req. getresponse ();
// Read all responses into one memorystream:
Memorystream MS = New Memorystream ();
Stream res = Response. getresponsestream ();
Byte [] Buffer = New Byte [ 8192 ];
While ( True )
{
Int Read = Res. Read (buffer, 0 , 8192 );
If (Read = 0 )
{
// If the server uses gzip, you can only judge whether the server has read more data.
Console. writeline ( " Response is terminated due to zero byte conflict. " );
Break ;
}
Else
{
Ms. Write (buffer, 0 , Read );
}
}
// Start with reverse Ms
Ms. Seek ( 0 , Seekorigin. Begin );
// Wrapped with gzipinputstream:
Gzipinputstream Gzip = New Gzipinputstream (MS );
Memorystream MS2 = New Memorystream ();
Try
{
Byte [] Buffer1 = New Byte [ 1 ]; // 1.1-point read-Because gzip on this server does not have footer, an error will occur at the end of reading. Therefore, to read the last byte, only 1.1-point read is allowed.
While ( True )
{
Int Read = Gzip. Read (buffer1, 0 , 1 );
If (Read = 0 ) Break ;
Ms2.write (buffer1, 0 , Read );
}
}
Catch (Exception SA)
{
Console. writeline ( " Exception! " + SA. tostring ());
}
Console. writeline ( " Unzipped. " );
// Save MS2 (extracted content) to a file
// , System. Text. encoding. getencoding ("gb2312 ")
Stream FS;
FS=File. Create ("R00000000000.txt");
Ms2.seek ( 0 , Seekorigin. Begin );
Ms2.writeto (FS );
FS. Close ();
// Using (streamreader reader = new streamreader (req. getresponse (). getresponsestream (), resencode ))
// {
// Result = reader. readtoend ();
// }
Result = System. Text. encoding. getencoding ( " Gb2312 " ). Getstring (ms2.toarray ());
Return Result;
}
The above code is a method for sending a GET request and obtaining HTML, including understanding the pressure process. I did not study the decompressed code carefully, but I personally think it is too cumbersome and I will have time to further optimize it. This method is used for the log list and the main content of the log. To decompress the package, add the following reference:
Using System. Io. compression;
Using Icsharpcode. sharpziplib. Zip;
Using Icsharpcode. sharpziplib. checksums;
UsingIcsharpcode. sharpziplib. gzip;
One problem with this log collector is that if the other's space is configured with access permissions, for example, a password is required to access the log collector, or there is no way to capture the log collector if the other's space does not have the access permission, the ID of the List page cannot be obtained. The returned HTML prompts no permission. This problem may take some time to analyze.
Now I want to release the source code. If you are interested, you can download the source code. If you have any questions or suggestions, I hope you can give me feedback. I would like to get some advice from experts. There are still many shortcomings in this application, but because of the time relationship, we will summarize the analysis of the main part. I sincerely hope you can give more comments and learn more valuable experience from them. The younger brother thanked me.
Source code download/files/think_fish/qqzoneapplicantionthread.rar
Fish songs QQ space Log Capture source code SP1 (New Comment capture function) download/files/think_fish/qqzoneapplicantionsp1.rar