Formally recommend me an open source project that can handle web crawling, parsing

Last Update:2015-09-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Autogrammerspider project, which was tested successfully on [www.taobao.com] today, this project can greatly alleviate the pain of analyzing Web pages when you crawl pages.

At present, in the function, efficiency above although there is still a lot of room for improvement, but the basic operation has little problem. I formally introduce this project today, I hope you can use the words of interest, put forward valuable advice, if really need any function can also tell me, I will be perfect as soon as possible.

This thing is used as follows,

First configure the feature file, put it in resource, Autospider below,

The contents of the signature file are as follows:

There are 3 places in this signature file that need attention,

One is the top namespace, this place is a name tag that you use, here is tb-category.

One is handler, which is your own definition of the handler function, which is called when the feature is fully matched.

There is also a <_list> tag, for example you want to match is <div someattr> <a someattr> <a someattr> .... </div> here because you don't know, Don't care if there are several <a> tags in it, you can add a <_list>, indicating that it is infinitely repetitive with the sibling elements in front of the level.

So what does handler look like?

Where the first parameter runs through an entire matching process, no matter how many handler you want to match, how many times handler, as long as the match of an article, the Executeparam is the same.

The second parameter is the element with the handler, the third parameter is the element that conforms to the entire feature, and here is a tag, and the <dd> element tag of its parent's parent.

Here is the entire test code:

Here you need to configure a bean and then use Createexecutor ("namespace inside the config file") to get an execution context for parsing the text, the following is the bean configuration, your own new object and then call the Init method can also:

Operation Result:

Here just matches a set of eigenvalues, the program supports multiple sets of matches, and if there are multiple HTML elements that satisfy the feature, it will match multiple times.

Finally, the location of the open source software, including the GIT code for the test project, is here:

Http://git.oschina.net/notebook

Since I have not yet uploaded to the Maven repository, I have put 2 jar packages in the published project.

Simply say the principle:

The functionality of this article is directly supported by Autogrammerspider, Autogrammer is the next level of support, it is a customizable syntax, customizable processing flow, custom error handling and so on the parser (in fact, because it is completely written by itself, It may not even conform to any grammatical rules, so it might be more appropriate to have a state machine generator, and it will do much more than the functionality mentioned in this article, and in the future I'll do other projects based on it.

Autogrammerspider itself is a compiler, it first has a built-in rule grammar, and then use this compiler to compile a user-configured configuration file, the result of this compilation is another compiler.

The user will use this second compiler to compile the text, the compiler's logic is this, first it can only identify the user in the configuration file defined properties and tags, and then he will try to follow the configuration file to read, once the read failure will discard the previous failed text, continue to match from scratch, Once the entire match succeeds, the handler is called, and then it repeats until the end of the text.

At the end of the final attached a few areas to be perfected

The matching of XML internal text is not supported at this time, and only the matching of tags and attributes is supported in <a>ww</a>.

There are, for example, the above matching rule, if it is

<div class= "J_cathook cathook" class= "c-2" > <a class= "c-3" > Something </a></div>

This will match the success, because class= "c-2" is not present in the feature, it will be ignored. But

<div class= "J_cathook cathook" class= "c-3" > <a class= "c-3" > Something </a></div>

Match fails because class= "c-3" exists in the feature and cannot be ignored.

Other flaws certainly have, but at present as the first version I think there is no need to consider so many cases, if you have any needs in use can tell me, I will update.

Formally recommend me an open source project that can handle web crawling, parsing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Formally recommend me an open source project that can handle web crawling, parsing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Formally recommend me an open source project that can handle web crawling, parsing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support