Some tips about crawler large external data with BCS Connector

Source: Internet
Author: User

To enable the SharePoint Search Component to retrieve external content sources (external databases, business systems, binary files, and so on), you usually need to create a custom indexing connector. Indexing connector is a component based on business connectivity services and search connector framework in SharePoint 2010. It replaces the previous protocol handler and becomes SharePoint 2010 (and fast search for Sharepoint 2010) the main external data crawling and expansion methods supported. (SharePoint 2010 still supports custom protocol handler .)

After creating a ctor via BCS, one of the possible problems is to use it to crawl a large amount of data. For example, millions or even tens of millions of data items. If connector needs to face such challenges, you need to carefully design connector.

First, connector must be able to support incremental crawling. You certainly do not want the speed of incremental network crawling to be the same as that of full network crawling.

Connector supports incremental crawling in two ways: Based on the last modification time (timestamp-based) and based on the modification log (changelog-based ). Based on the last modification time, that is, you need to specify a date and time field. The crawler uses this field as the last modification time of the data item. When the incremental crawling is performed, the crawler compares the data, to determine whether you need to re-process a data item. Based on the log modification, a special method is used to directly return newly added, modified, and deleted data items to the search engine, in this way, the search engine knows which data items need to be re-processed from the previous crawling to the present.

If the external content source has a large amount of data, even if it is the first full crawling, it may cause the crawler to fail to work normally, or cause the external content source to withstand excessive pressure in a short time.

First, consider whether to directly use a finder method to return all required data (similar to the select * From db_table operation) from the external content source at one time ). In the case of small data volumes, this is very convenient, but if the data volume is too large, this may be inappropriate.

It is prudent to use only the idenumerator method and specificfinder method to obtain data. The idenumerator method (similar to select ID from db_table) can return the ID of the data item, and then use these IDs to repeatedly call the specificfinder method (similar to select * From db_table where id = @ ID ), to obtain data items one by one. In this case, you need to tell connector that the idenumerator method is the rootfinder of the entire entity. In most cases, you do not even need to define a finder method, because crawlers should not retrieve too much data from external content sources at a time.

If the data volume is too large, even the execution of the idenumerator method may be faulty. Imagine returning IDs of tens of millions of data items from an external data source at a time. In this case, we need to go further and let the idenumerator method return only a specific number of data item IDS (such as 1000) at a time.

We can define a filter with the type of lastid for the idenumerator method and the corresponding input parameter (I .e., the parameter with the direction of in), so that the crawler will try to call the idenumerator method again, each time, the last Id obtained after the last call is passed as a parameter. In the implementation code of the idenumerator method, you can retrieve the data item ID after the ID from the external content source based on the input parameter.

The crawler will try this method to repeatedly call the idenumerator method until it returns 0 results. (The number of data items returned by each call depends only on how many items are returned in the implementation code of the idenumerator method .)

The search engine's crawler may call the specificfinder method multiple times to obtain the same data item (the specific cause is unknown ...), therefore, in connector, you can consider using some Cache techniques to reduce the number of requests to external content sources. If the data size of the content source is large, it is not a good idea to save all the data to the memory. You can consider using httpruntime. cache, although it seems that it can only be used for web programs, but actually only need to reference system in the program. the Web assembly can be used in your connector (Note: It is noted in the Microsoft msdn document that it is not in ASP.. Net program may not work normally, but it does not seem to be working normally, but I do not guarantee that ...). The cache implementation has built-in functions such as automatic clearing, priority setting, and cache time setting, which is much more convenient than writing one by yourself. In addition, entlib also contains a cache implementation that can be used for any type of programs.

In addition to the above, Eric Wang also contributed an idea. That is, (according to his point of view) It is no problem to obtain a large amount of data at a time. If you really encounter massive data, you can create multiple Content sources in search management, each content source crawls a part of external data. For example, if there are 2 million external data items, you can create four content sources. Each content source obtains and crawls 0.5 million of the data items based on certain rules.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.