The problem that afflicted me for a few days was finally solved today, sharing some of the recent experiences of SOLR.
It was a crawl of the page with Nutch, but the customer needed to crawl the RSS feed, and could identify those pages that were crawled through the RSS feeds. Nutch Although the plug-in with parsing RSS, but some RSS can not parse, but also not good control, more important after the crawl and the normal page is not much difference, can not be identified by which RSS source to crawl out. Because of the reasons above, so I wrote a C # in conjunction with SOLR Crawl RSS project.
All achieved good, the customer is very satisfied, I also feel good, but after a period of time found Nutch in solrdedup failure, resulting in nutch can not use. The following is the introduction of the principle of RSS implementation, and the issue of the emergence of the solution to SOLR and nutch these problems are not because it is very difficult to understand, mainly such problems on the Internet is difficult to find. Because the company does not have the network and copy permission, all can only be written by my memory.
the implementation of RSS+SOLR is use WebRequest to read the contents of an XML format into an RSS feed, and to index the SOLR directly inside the post method. In order to meet customer requirements, I added the RSS and Isrss fields in SOLR's Schame, RSS feeds the URL address of the RSS feed, Isrss fixed to ture. Since Nutch does not have these 2 fields, we only need to enter Isrss to query RSS: "True" to filter out pages that are not RSS.
The main points to note in the implementation process are the following
1.rss source is not a file suffix XML, some are normal page response out, and some of the need to login permissions
2.rss currently has 2 kinds of formats, the common XML structure is Rss/channel/item, and the other is the structure feed/entry of our blog Park RSS.
The label inside is fixed, and you can find the value that filed needs in SOLR with a different label.
3.id,digest,tstamp three fields are required.
The above mentioned Nutch run to Solrdedup when the error, beginning I think is SOLR inside the new Add 2 fields, try to Nutch also added 2 fields, but also take advantage of nutch static field plug-ins and extra field plug-ins, but no use, Finally I found that the data after the SOLR to clean up incredibly nutch can run normally, I also looked at Google a lot, but basically no help, I thought it was solr cache reason. Today is just a bit empty, so I found the Nutch error file Solrdeleteduplicates.java, the study found that nutch removal of the duplication is from SOLR directly take the Id,digest,tstamp field, there is no judge whether it is empty directly used.
But I wrote the program inside Digest and Tsamp is not added, sure enough to add the 2 columns, all the data without the 2 fields are complete, nutch and can run normally
4.digest is a 32-bit hash of the Web page used to compare differences when removing duplicates Nutch
5.SOLR is the content can be added directly through the request URL, modify, delete, where the modified format is the same as the new
The main core is: Xx.xx.xx.xx/solr/update?stream.body=<add><doc><field name= "XX" >xx</filed></doc ><add>&stream.contenttype=text/xm;charset=utf-8&commit=true
If content has ur format and HTML format need to transcode it
6,SOLR Time is GMT format time, so do not make a mistake, and RSS in the GMT format some is wrong, I met a lot, the week will cause SOLR index failure.
Now the information is still relatively complete, search is more convenient, but there are a lot of problems need to be solved, mainly JS generated page can not read, there is the filtering of information on the page. Access to the main content of the article, filtering navigation and other information is still relatively difficult, I found that others are based on statistical laws in the filtering, such as the middle of a lot of navigation bar is separated, the contents of the space spacing. Each site's HTML layout style is different, the label is difficult to unify, Baidu, Google also do not know how to achieve, or they actually did not realize.
Nutch is still relatively strong, but always feel bad maintenance and modification, the last compilation of source code all took a long time, SOLR query is more efficient. You may be planning to write a. NET version of the crawler and search, but only in the plan, because there are too many problems involved ...
C # reads RSS feeds and leverages the SOLR index