C # reads RSS feeds and leverages the SOLR index

Last Update:2014-11-27 Source: Internet

Author: User

Tags solr solr query

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The problem that afflicted me for a few days was finally solved today, sharing some of the recent experiences of SOLR.

It was a crawl of the page with Nutch, but the customer needed to crawl the RSS feed, and could identify those pages that were crawled through the RSS feeds. Nutch Although the plug-in with parsing RSS, but some RSS can not parse, but also not good control, more important after the crawl and the normal page is not much difference, can not be identified by which RSS source to crawl out. Because of the reasons above, so I wrote a C # in conjunction with SOLR Crawl RSS project.

All achieved good, the customer is very satisfied, I also feel good, but after a period of time found Nutch in solrdedup failure, resulting in nutch can not use. The following is the introduction of the principle of RSS implementation, and the issue of the emergence of the solution to SOLR and nutch these problems are not because it is very difficult to understand, mainly such problems on the Internet is difficult to find. Because the company does not have the network and copy permission, all can only be written by my memory.

　the implementation of RSS+SOLR is use WebRequest to read the contents of an XML format into an RSS feed, and to index the SOLR directly inside the post method. In order to meet customer requirements, I added the RSS and Isrss fields in SOLR's Schame, RSS feeds the URL address of the RSS feed, Isrss fixed to ture. Since Nutch does not have these 2 fields, we only need to enter Isrss to query RSS: "True" to filter out pages that are not RSS.

The main points to note in the implementation process are the following

1.rss source is not a file suffix XML, some are normal page response out, and some of the need to login permissions

2.rss currently has 2 kinds of formats, the common XML structure is Rss/channel/item, and the other is the structure feed/entry of our blog Park RSS.

The label inside is fixed, and you can find the value that filed needs in SOLR with a different label.

3.id,digest,tstamp three fields are required.

The above mentioned Nutch run to Solrdedup when the error, beginning I think is SOLR inside the new Add 2 fields, try to Nutch also added 2 fields, but also take advantage of nutch static field plug-ins and extra field plug-ins, but no use, Finally I found that the data after the SOLR to clean up incredibly nutch can run normally, I also looked at Google a lot, but basically no help, I thought it was solr cache reason. Today is just a bit empty, so I found the Nutch error file Solrdeleteduplicates.java, the study found that nutch removal of the duplication is from SOLR directly take the Id,digest,tstamp field, there is no judge whether it is empty directly used.

But I wrote the program inside Digest and Tsamp is not added, sure enough to add the 2 columns, all the data without the 2 fields are complete, nutch and can run normally

4.digest is a 32-bit hash of the Web page used to compare differences when removing duplicates Nutch

5.SOLR is the content can be added directly through the request URL, modify, delete, where the modified format is the same as the new

The main core is: Xx.xx.xx.xx/solr/update?stream.body=<add><doc><field name= "XX" >xx</filed></doc ><add>&stream.contenttype=text/xm;charset=utf-8&commit=true

If content has ur format and HTML format need to transcode it

6,SOLR Time is GMT format time, so do not make a mistake, and RSS in the GMT format some is wrong, I met a lot, the week will cause SOLR index failure.

Now the information is still relatively complete, search is more convenient, but there are a lot of problems need to be solved, mainly JS generated page can not read, there is the filtering of information on the page. Access to the main content of the article, filtering navigation and other information is still relatively difficult, I found that others are based on statistical laws in the filtering, such as the middle of a lot of navigation bar is separated, the contents of the space spacing. Each site's HTML layout style is different, the label is difficult to unify, Baidu, Google also do not know how to achieve, or they actually did not realize.

Nutch is still relatively strong, but always feel bad maintenance and modification, the last compilation of source code all took a long time, SOLR query is more efficient. You may be planning to write a. NET version of the crawler and search, but only in the plan, because there are too many problems involved ...

C # reads RSS feeds and leverages the SOLR index

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More