Some tips about crawler large external data with BCS Connector

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

To enable the SharePoint Search Component to retrieve external content sources (external databases, business systems, binary files, and so on), you usually need to create a custom indexing connector. Indexing connector is a component based on business connectivity services and search connector framework in SharePoint 2010. It replaces the previous protocol handler and becomes SharePoint 2010 (and fast search for Sharepoint 2010) the main external data crawling and expansion methods supported. (SharePoint 2010 still supports custom protocol handler .)

After creating a ctor via BCS, one of the possible problems is to use it to crawl a large amount of data. For example, millions or even tens of millions of data items. If connector needs to face such challenges, you need to carefully design connector.

First, connector must be able to support incremental crawling. You certainly do not want the speed of incremental network crawling to be the same as that of full network crawling.

Connector supports incremental crawling in two ways: Based on the last modification time (timestamp-based) and based on the modification log (changelog-based ). Based on the last modification time, that is, you need to specify a date and time field. The crawler uses this field as the last modification time of the data item. When the incremental crawling is performed, the crawler compares the data, to determine whether you need to re-process a data item. Based on the log modification, a special method is used to directly return newly added, modified, and deleted data items to the search engine, in this way, the search engine knows which data items need to be re-processed from the previous crawling to the present.

If the external content source has a large amount of data, even if it is the first full crawling, it may cause the crawler to fail to work normally, or cause the external content source to withstand excessive pressure in a short time.

First, consider whether to directly use a finder method to return all required data (similar to the select * From db_table operation) from the external content source at one time ). In the case of small data volumes, this is very convenient, but if the data volume is too large, this may be inappropriate.

It is prudent to use only the idenumerator method and specificfinder method to obtain data. The idenumerator method (similar to select ID from db_table) can return the ID of the data item, and then use these IDs to repeatedly call the specificfinder method (similar to select * From db_table where id = @ ID ), to obtain data items one by one. In this case, you need to tell connector that the idenumerator method is the rootfinder of the entire entity. In most cases, you do not even need to define a finder method, because crawlers should not retrieve too much data from external content sources at a time.

If the data volume is too large, even the execution of the idenumerator method may be faulty. Imagine returning IDs of tens of millions of data items from an external data source at a time. In this case, we need to go further and let the idenumerator method return only a specific number of data item IDS (such as 1000) at a time.

We can define a filter with the type of lastid for the idenumerator method and the corresponding input parameter (I .e., the parameter with the direction of in), so that the crawler will try to call the idenumerator method again, each time, the last Id obtained after the last call is passed as a parameter. In the implementation code of the idenumerator method, you can retrieve the data item ID after the ID from the external content source based on the input parameter.

The crawler will try this method to repeatedly call the idenumerator method until it returns 0 results. (The number of data items returned by each call depends only on how many items are returned in the implementation code of the idenumerator method .)

The search engine's crawler may call the specificfinder method multiple times to obtain the same data item (the specific cause is unknown ...), therefore, in connector, you can consider using some Cache techniques to reduce the number of requests to external content sources. If the data size of the content source is large, it is not a good idea to save all the data to the memory. You can consider using httpruntime. cache, although it seems that it can only be used for web programs, but actually only need to reference system in the program. the Web assembly can be used in your connector (Note: It is noted in the Microsoft msdn document that it is not in ASP.. Net program may not work normally, but it does not seem to be working normally, but I do not guarantee that ...). The cache implementation has built-in functions such as automatic clearing, priority setting, and cache time setting, which is much more convenient than writing one by yourself. In addition, entlib also contains a cache implementation that can be used for any type of programs.

In addition to the above, Eric Wang also contributed an idea. That is, (according to his point of view) It is no problem to obtain a large amount of data at a time. If you really encounter massive data, you can create multiple Content sources in search management, each content source crawls a part of external data. For example, if there are 2 million external data items, you can create four content sources. Each content source obtains and crawls 0.5 million of the data items based on certain rules.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Some tips about crawler large external data with BCS Connector

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Some tips about crawler large external data with BCS Connector

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support