Continue node Crawler-The hundred lines of code self-made automatic AC robot to conquer Hdoj__hdu reptile in daily solution

Source: Internet
Author: User
Preface

Do not speak, first jab ranklist to see my ranking.

This is to use node automatic brush problem about half a day of "record", this article to explain how to use node to do an "automatic AC machine." Process

First to pull OJ (online judge). Computer college students should not be unfamiliar with ACM, ACM Competition is the spelling algorithm and data structure of the competition, and OJ is the practice of ACM "venues." Domestic more famous OJ have Poj, Zoj and Hdoj, and so on, here I chose hdoj (entirely because of local hdoj fast speed).

It's very easy to do the OJ in the exam. Take Hdoj as an example, first register an account (http://bestcoder.hdu.edu.cn/register.php), and then open a question casually (http://acm.hdu.edu.cn/showproblem.php?pid= 1000), click the Submit button below (http://acm.hdu.edu.cn/submit.php?pid=1000), select the submission language (Language), copy the answer, and then click submit Button, You can then check to see if AC (accepted) is available (http://acm.hdu.edu.cn/status.php).

Using node to simulate the user's process is actually a simulation login + simulation submission process, according to experience, the simulation submitted this post process will certainly have cookies. Where did the code of submission come from? Just crawl the search engine.

The whole idea was very clear: analog login (POST) analog login from search engine crawl code (GET) Analog submit (POST)

First look at the analog login, according to experience, this is probably a post process, will be the user name and password to post to the server. Open the CHROME,F12, grab the bag, and, if necessary, hook up the Preserve log option.

The request header actually also has the cookie, after testing, the key is PHPSESSID this cookie is the request must, this cookie where comes. In fact, as soon as you open the http://acm.hdu.edu.cn/domain name under any address, the server will put this Cookie "kind" in the browser. Generally you login must first open the login page. This cookie is naturally available when you open it, and the login request will carry the cookie. Once the request is successful, the server creates a session with the client, and the service side represents the cookie I know, and I can pass the request with this cookie every time. Once the user exits, the session is aborted, and the server removes the cookie from the knowledge list, even if it is submitted with the cookie again, and the server says "I don't know You".

So the analog login can be divided into two processes, first request the http://acm.hdu.edu.cn/domain name under any one address, and the return header key for the Phpsessid Cookie out to save (key=value form), and then carry a cookie Post request for login.

Impersonation Login Function Login () {superagent//Get requests a URL under any acm.hdu.edu.cn domain name//Gets the key for PHPSESSID this Cookie . Get (' http://acm.hdu.edu.cn/status.php '). End (function (err, sres) {//extract Cookie var str = sres.header[' SE
      T-cookie '][0];

      Filter path var pos = Str.indexof (';');

      Global variables store cookies, login and post code with Globalcookie = STR.SUBSTR (0, POS); Analog login superagent//Login URL. Post (' Http://acm.hdu.edu.cn/userloginex.php?action=login ')//
        Post username & password. Send ({"username": "Hanzichi"}). Send ({"Userpass": "Hanzichi"})//This request header is required 
        . Set ("Content-type", "application/x-www-form-urlencoded")//request to carry cookies. Set ("Cookie", Globalcookie)
        . End (function (err, sres) {//login completed, start program start ();
    });
}); }

When simulating an HTTP request, some request headers are required and others can be ignored. For example, analog login post, Content-type This request head is required to carry, looking for me for a long time, if the program has not been started, you can try to put all the request head, one by one. search engine Crawl Code

This is a rough part of my work, which is why I have a low rate of crawler AC accuracy.

I chose Baidu to crawl the answer. To hdu1004 this case, if you want to search the AC code for this question, we will generally enter hdu1004 in the Baidu search box, and the result of the page URL is https://www.baidu.com/s?ie=UTF-8&wd=hdu1004. This URL is also very regular, https://www.baidu.com/s?ie=UTF-8&wd = plus keyword.

Baidu a page will show 10 search results, the code I chose the Acmer in the CSDN, because the CSDN code block is too easy to find, do not believe please see.

CSDN put the code entirely in a DOM element with a class of CPP, which is simply too friendly. In contrast, the blog park and other places also want string filtering, in order to simple and easy, directly selected the csdn of the code.

At first I thought, a search results page has 10 results, each result obviously has a page of details URL, to determine if there is no csdn in the URL, if there is, then enter the details page to catch code. But Baidu actually to this URL encryption.

I noticed that each search result also comes with a URL with a small typeface, no encryption, see figure below.

So I decided to parse the URL and, if I had the csdn typeface, jump to the details page of the search results to crawl the code. In fact, with CSDN is not necessarily able to catch code (CSDN Other level two domain name, such as Download channel  http://download.csdn.net/ ), so in the GetCode () function, wrote a try{}. catch () {} to avoid code errors.

Simulation Baidu search function Bdsearch (problemid) {var searchurl = ' https://www.baidu.com/s?ie=UTF-8&wd=hdu ' + problemid; Simulate Baidu search superagent. Get (Searchurl)//Required request headers. Set ("User-agent", "mozilla/5.0" (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/47.0.2526.111 safari/537.36 "). End (function (err, sres) {VA
      R $ = cheerio.load (Sres.text);
      var lis = $ ('. T a ');

        for (var i = 0; i < i++) {var node = Lis.eq (i);

        Get that little URL address var text = Node.parent (). Next (). Next (). Children ("a"). Text ();

        If the URL does not have the CSDN typeface, it returns if (Text.tolowercase (). IndexOf ("csdn") = = 1) continue;
        The Puzzle Details page URL var solutionurl = node.attr (' href ');
      GetCode (SolutionUrl, Problemid);
}
    }); }

The Bdsearch () function passes a parameter to the HDOJ topic number. Then to crawl Baidu get the details page URL, after testing crawled Baidu must have UA. The rest is very simple, and the comments in the code are clear.

Get Code
function GetCode (SolutionUrl, Problemid) {

  superagent.get (solutionurl, function (err, sres) from the CSDN Puzzle Detail Page {//
    to prevent this solutionurl may not be a puzzle detail page
    //DOM element
    try {
      var $ = cheerio.load (Sres.text) without class as CPP;

      var code = $ ('. cpp '). EQ (0). text ();

      if (!code) return
        ;
      
      Post (code, PROBLEMID);
    } catch (E) {

    }
    
  });

The GetCode () function gets the code based on the Puzzle detail page. As I said, Csdn's code block is very straightforward, all in a DOM element with a class name of CPP. Simulate commit

The final step is to look at the mock submissions. We can grab this post bag and see what it looks like.

It's obvious that cookies are necessary and we've got this cookie in the first step of simulating the login. Because this is a form submission, the Content-type request key also needs to be carried. In other words, in the request data, Problemid is obviously the title number, code is clearly the above the above.

The impersonation code submits
the function post (code, PROBLEMID) {
  superagent
    . Post (' http://acm.hdu.edu.cn/submit.php?action= Submit ')
    . Set (' Content-type ', ' application/x-www-form-urlencoded ').
    set ("Cookie", Globalcookie)
    . Send ({"Problemid": Problemid})
    . Send ({"Usercode": Code})
    . End (function (err, sres) {
    });
Complete Code

The complete code can refer to Github.

Where Singlesubmit.js for a single topic submission, instance code for hdu1004 submission, and allsubmit.js for all code submission, the code I set a 10s delay, that is, every 10s to Baidu search a problem, because to crawl Baidu, csdn to and hdoj three sites, any one site IP will stop the entire irrigation machine operation, so the pressure is still very large, set a 10s delay after the wood has what problem.

Learning node is mainly because of the interest in reptiles, but also continued to complete a number of simple crawl, you can move to my blog in the Node.js series. Before I threw the code all over the Github, someone star and fork, I was flattered, decided to build a new directory for my Reptile project, to record the process of learning node, the project address https://github.com/1335661317/ Funny-node/tree/master/auto-ac-machine. I'm going to sync my node crawler code here, and I'll record each crawler's implementation and save it as a readme.md file for each small directory. subsequent optimizations

Look carefully, in fact, my reptile is very "weak", the correct rate is very low, not even AC hdu1001. I think there are several things that can be done to improve the following:

Csdn the title filter when crawling the details page of the puzzle. For example, crawl hdu5300 https://www.baidu.com/s?ie=UTF-8&wd=hdu5300, search results are HDU4389, the program obviously did not anticipate this point, and will be the code submitted, obviously WA off. And if in the details page of the title filter, can effectively avoid this point, because Acmer writing, title will generally bring hdu5300 or hdoj5300 words.

Crawl specific sites. Crawl Baidu is not a wise move, my actual AC correct rate of 50% or so, I am a ni, the problem is half of the code is wrong. May be some submissions to choose the wrong language (post when there is a language parameter, the default is 0 for g++ submit, the program is submitted to the g++), in fact, we can not judge Baidu search to get the key code is really correct. How to improve the correct rate. We can crawl some of the site, such as http://accepted.com.cn/or http://www.acmerblog.com/, or even crawl http://acm.hust.edu.cn/vjudge/ The code for AC in Problem/status.action.

The

Gets the submission results in real time. My code is relatively rough, crawl Baidu search the first page of the CSDN code, if there are 10 to submit 10, if not then do not submit. A better strategy is to get the results of the submission in real time, such as submitting the first one, getting the return result, and if WA is still submitting it, break it if AC is available. To get the result of the submission, the return interface is not found for the time being, from the

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.