Formally recommend me an open source project that can handle web crawling, parsing

Source: Internet
Author: User

Autogrammerspider project, which was tested successfully on [www.taobao.com] today, this project can greatly alleviate the pain of analyzing Web pages when you crawl pages.

At present, in the function, efficiency above although there is still a lot of room for improvement, but the basic operation has little problem. I formally introduce this project today, I hope you can use the words of interest, put forward valuable advice, if really need any function can also tell me, I will be perfect as soon as possible.

This thing is used as follows,

First configure the feature file, put it in resource, Autospider below,

The contents of the signature file are as follows:

There are 3 places in this signature file that need attention,

One is the top namespace, this place is a name tag that you use, here is tb-category.

One is handler, which is your own definition of the handler function, which is called when the feature is fully matched.

There is also a <_list> tag, for example you want to match is <div someattr> <a someattr> <a someattr> .... </div> here because you don't know, Don't care if there are several <a> tags in it, you can add a <_list>, indicating that it is infinitely repetitive with the sibling elements in front of the level.

So what does handler look like?

Where the first parameter runs through an entire matching process, no matter how many handler you want to match, how many times handler, as long as the match of an article, the Executeparam is the same.

The second parameter is the element with the handler, the third parameter is the element that conforms to the entire feature, and here is a tag, and the <dd> element tag of its parent's parent.

Here is the entire test code:

Here you need to configure a bean and then use Createexecutor ("namespace inside the config file") to get an execution context for parsing the text, the following is the bean configuration, your own new object and then call the Init method can also:

Operation Result:

Here just matches a set of eigenvalues, the program supports multiple sets of matches, and if there are multiple HTML elements that satisfy the feature, it will match multiple times.


Finally, the location of the open source software, including the GIT code for the test project, is here:

Http://git.oschina.net/notebook

Since I have not yet uploaded to the Maven repository, I have put 2 jar packages in the published project.

Simply say the principle:

The functionality of this article is directly supported by Autogrammerspider, Autogrammer is the next level of support, it is a customizable syntax, customizable processing flow, custom error handling and so on the parser (in fact, because it is completely written by itself, It may not even conform to any grammatical rules, so it might be more appropriate to have a state machine generator, and it will do much more than the functionality mentioned in this article, and in the future I'll do other projects based on it.

Autogrammerspider itself is a compiler, it first has a built-in rule grammar, and then use this compiler to compile a user-configured configuration file, the result of this compilation is another compiler.

The user will use this second compiler to compile the text, the compiler's logic is this, first it can only identify the user in the configuration file defined properties and tags, and then he will try to follow the configuration file to read, once the read failure will discard the previous failed text, continue to match from scratch, Once the entire match succeeds, the handler is called, and then it repeats until the end of the text.


At the end of the final attached a few areas to be perfected

The matching of XML internal text is not supported at this time, and only the matching of tags and attributes is supported in <a>ww</a>.

There are, for example, the above matching rule, if it is

<div class= "J_cathook cathook" class= "c-2" > <a class= "c-3" > Something </a></div>

This will match the success, because class= "c-2" is not present in the feature, it will be ignored. But

<div class= "J_cathook cathook" class= "c-3" > <a class= "c-3" > Something </a></div>

Match fails because class= "c-3" exists in the feature and cannot be ignored.

Other flaws certainly have, but at present as the first version I think there is no need to consider so many cases, if you have any needs in use can tell me, I will update.




Formally recommend me an open source project that can handle web crawling, parsing

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.