C # Read the RSS source and use Solr indexes,

Source: Internet
Author: User
Tags solr

C # Read the RSS source and use Solr indexes,

The problem that has plagued me for a few days is finally solved today. I will share some of my recent experiences with solr.

The web page was originally crawled by using nutch, but the customer needs to crawl RSS and can identify those pages captured by RSS sources. Although nutch comes with a plug-in for parsing RSS, some RSS cannot be parsed and cannot be controlled. More importantly, there is no big difference between it and common pages after crawling, it cannot identify which rss source is crawled. For the above reason, I wrote a project with Solr to capture RSS with C.

After everything was done, the customer was very satisfied and I thought it was a good job. However, after a while, I found that the failure of nutch in solrdedup made it unavailable. Next we will introduce the principles of rss implementation and the generation of problems. Solving solr and nutch is not very difficult to understand, but mainly because these problems are hard to be found on the Internet. Because the company does not have the Internet and copy permissions, everything can only be written in my memory.

The implementation of RSS + Solr is to use webrequest to read the content in the xml format of the rss source, and directly use the post method to create an index for solr. To meet customer requirements, I added the rss and isrss fields to solr's schame. rss is the url address of the rss source, and isrss is fixed to ture. Because the two fields are not available in the nutch, we only need to enter isrss: "true" to filter out pages that are not rss.

Pay attention to the following points during implementation

1. rss sources are not files suffixed with xml, some of which are from common page response, and some require logon permissions.

2. There are currently two types of rss, the common xml structure is rss/channel/item, and the other is the structure feed/entry of rss in our blog Park.

The labels are fixed. You can use different labels to find the values required by filed in solr.

3. Three columns, id, digest, and tstamp, are required.

As mentioned above, an error is reported when nutch runs to solrdedup. At first, I thought it was the two new fields added in solr, and I tried to add two new fields for nutch, in addition, the static column plug-ins and additional column plug-ins of nutch are also used, but they are useless. Finally, I found that after the solr data is cleared, nutch can run normally, I also checked a lot in google, but it was basically useless. I thought it was the reason for solr cache. I found the file solrdeleteduplicates. java. The study found that the id, digest, and tstamp fields are directly obtained from solr to remove duplicates in the content. If it is not determined whether it is null, it is directly used.

However, the digest and tsamp fields in the program I wrote are not added. The two columns are added, and all the data without the two columns is complete, now that nutch is running properly

4. digest is the 32-bit hash value of the Web page. It is used to compare differences when the content is removed from duplicates.

5. solr allows you to add, modify, and delete content directly through the request url. The modified format is the same as the new content.

Core: xx. xx/solr/update? Stream. body = <add> <doc> <field name = "xx"> xx </filed> </doc> <add> & stream. contentType = text/xm; charset = UTF-8 & commit = true

If the content is in ur and html formats, You Need To transcode the content.

6. The solr time is in GMT format, so do not make a mistake. In addition, some GMT formats in rss are incorrect, so I have encountered many problems. If the week is incorrect, the solr index will fail.

 

At present, the information is still relatively complete and the search is more convenient, but there are still many problems to solve, mainly because the js generated page cannot be read, and the information on the page is filtered. It is still difficult to obtain the main content of the article and filter other information such as navigation. Currently, I have found that others are filtering based on statistical rules. For example, most navigation entries are separated by commas, the space interval in the content. The html layout of each site is different. tags are difficult to unify. Baidu and Google do not know how to implement them, or they are not actually implemented.

However, it is difficult to maintain and modify the source code. It takes a long time to compile the source code last time. Therefore, solr queries are efficient. It may be planned to write a. net crawler and search, but it is only in the plan, because there are too many problems involved...

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.