Crawl best practices in SharePoint Server 2013

Last Update:2015-12-15 Source: Internet

Author: User

Tags server hosting

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Learn best practices for crawling in SharePoint Server 2013

The search system crawls content to build a search index on which users can run search queries. This article contains recommendations for how to manage crawls most effectively.

The content of this article:

Most content is crawled with the default content access account
effective use of content sources
use continuous crawls to ensure that search results are up to date
Use crawl rules to exclude content that is not related to crawling
use Active Directory groups instead of individual user permissions
add a second crawl component to provide fault tolerance
Manage environment resources to improve crawl performance
Make sure no crawl is active, and then change the search topology
Remove the crawl component from the crawl host, and then remove the host from the farm
After you change the crawl configuration or apply the update, test the crawl and query functionality

Attention:

Because SharePoint 2013 runs as a Web site in Internet information Services (IIS), administrators and users rely on the accessibility features provided by the browser. SharePoint 2013 supports accessibility features for supported browsers. For more information, see the following resources:

Planning browser Support
Accessibility features for SharePoint 2013
Accessibility features in SharePoint 2013 products
Keyboard shortcuts
Touch control

Crawl most content using the default content access account

The default content access account is the domain account that you specify for SharePoint Search Service, which is used by default for crawling. For simplicity, it is best to use this account to crawl content specified by the content source whenever possible. To change the default content access account, see Change the default account for crawling in SharePoint 2013.

If you cannot use the default content access account to crawl a specific URL (for example, because of security reasons), you can create a crawl rule for the validation crawler to specify one of the following alternative methods:

Different content access accounts
Client certificate
Form credentials
Cookies for crawling
Anonymous access

For more information, see Manage crawl rules in SharePoint Server 2013.

Use content sources effectively

A content source is a set of options in a Search Service application that specifies one of the following:

One or more start addresses for crawling.
The content type in the start address, such as a SharePoint site, file share, or line-of-business data. You can specify only one content type that is crawled in the content source. For example, you can use a content source to crawl SharePoint sites, or you can use different content sources to crawl file shares.
Crawl schedules and crawl priorities for full crawls or incremental crawls, for all content libraries specified by the content source.

When you create a search Service application, the discovery system automatically creates and configures a content source, called a local SharePoint site. This preconfigured content source is used to crawl user profiles and to crawl all SharePoint sites in the WEB application associated with the Search Service application. You can also use this content source for internal in other SharePoint server farms, such as SharePoint Server 2007 farms, SharePoint Server 2010 farms, or other SharePoint Server 2013 farm To crawl.

Create additional content sources when you want to do any of the following:

Crawling Other content Types
Limit or increase the amount of content that is crawled
A higher or lower frequency of crawling some content
Set different priorities for crawling certain content (this applies to full crawls or incremental crawls, but not to continuous crawls)
Crawl certain content according to different schedules (this applies to full crawls and incremental crawls, but not to continuous crawls)

However, to simplify administration as much as possible, we recommend that you limit the number of content sources that you create and use.

Scheduling crawls with content sources

You can edit a preconfigured content source, local SharePoint site, to specify a crawl schedule, and do not specify a crawl schedule by default. For any content source, you can start the crawl manually, but it is recommended that you schedule an incremental crawl or enable continuous crawls to ensure that content can be crawled on a regular basis.

Consider crawling content based on different plans using different content sources for the following reasons.

Accommodates server downtime and server peak usage time periods.
Crawl content that is hosted on slower servers and content that is hosted on a faster server, respectively.
More frequent crawls of content that are updated more frequently.

Crawling content can significantly reduce the performance of the server that hosts the content. The effect depends on whether the host server has sufficient resources (especially CPU and RAM) to handle the load. Therefore, consider the following best practices when planning your crawl schedule:

Schedule crawls for each content source when the server hosting the content is available, and when the server resource requirements are low.
Staggered crawl schedules so that the load on the crawl server and the host server is distributed over time. In this way, you can optimize the crawl schedule if you are familiar with the typical crawl duration for each content source by checking the crawl logs. For more information, see View crawl logs in search diagnostics in SharePoint Server 2013.
If necessary, run only a full crawl. For more information, see why you have a full crawl in planning for crawling and federation in SharePoint Server 2013. For any administrative changes that require a full crawl to take effect, such as creating a crawl rule, make the changes before the next full crawl is imminent, so that no additional full crawls are required. For more information, see Manage crawl rules in SharePoint Server 2013.

Crawl user profiles and then crawl SharePoint sites

By default, in the first Search Service application in the farm, the preconfigured content source "local SharePoint site" contains at least the following two start addresses:

Http://Web_application_public_URL, for crawling all SharePoint sites in a WEB application
Sps3://my_site_host_url, crawling user profiles

However, if you are deploying people search, it is recommended that you create a separate content source for the start address, Sps3://my_site_host_url, and first run the crawl for that content source. The reason for this is that after the crawl is completed, the search system generates a list to standardize the name of the person. This way, if the person name is in a different format in a search result set, all results for that person are displayed in a single group, such as the result block. For example, for a search query "Anne Weiler", all documents authored by Anne Weiler, A.weiler, or alias Annew can be displayed in the result block labeled "Documents authored by Anne Weiler". Similarly, all documents authored by any of these identities can be displayed under the title "Anne Weiler" of the refinement panel (if the author is one of the categories here).

Crawl user profiles and then crawl SharePoint sites

Verify that the user account that is performing this procedure is the administrator of the Search Service application that you want to configure.
Follow the instructions in deploy people search in SharePoint Server 2013. As part of these instructions, do the following:

Create a content source that crawls only the user profile (profile store). The content source can be given a name, such as people. In the new content source, in the Start Address section, type Sps3://my_site_host_url, where My_site_host_url is the URL of the My Site host.
Start the crawl for the people content source you just created.
Remove the start address Sps3://my_site_host_url from the pre-configured content source "local SharePoint site".

After the crawl of the people content source is complete, wait about two hours.
Starts the first full crawl of the local SharePoint site for the content source.

Use continuous crawls to ensure that search results are up-to-date

Enable continuous crawl is a crawl schedule option that you can select when you add or edit a content source of type SharePoint site. A continuous crawl crawls content that was added, changed, or deleted after the last crawl. A continuous crawl starts at a predefined time interval. The default time interval is 15 minutes, but you can set up continuous crawls to run at shorter intervals by using Windows PowerShell. Continuous crawls are highly frequent, so you can ensure that the freshness of the search index is updated frequently, even for SharePoint content. In addition, if an incremental crawl or full crawl delay is caused by multiple crawl attempts on a particular project and an error is returned, the other content can be crawled through a continuous crawl and help maintain the freshness of the index, because successive crawls do not process or retry items that return errors more than three times. (For content sources with continuous crawls enabled, a clean incremental crawl is automatically run every four hours to re-crawl all items that return errors repeatedly.) ）

A single continuous crawl includes all content sources that have a continuous crawl started in the Search Service application. Similarly, the continuous crawl interval applies to all content sources that have a continuous crawl started in the Search Service application. For more information, see Manage persistent crawls in SharePoint Server 2013.

Continuous crawls increase the load on the crawler and crawl targets. Be sure to plan and scale out accordingly for this increased resource consumption. For each large content source that enables continuous crawls, it is recommended that you configure one or more front-end WEB servers as dedicated servers for crawling. For more information, see Managing Crawl loads (SharePoint Server 2010).

Use crawl rules to exclude unrelated content that has been crawled

Because crawls consume resources and bandwidth, it is a good idea to crawl a small amount of content that you know about in the initial deployment, rather than crawling a large amount of content that may have some unrelated content. To limit the number of content crawled, you can create a crawl rule for the following reasons:

Avoid crawling irrelevant content by excluding one or more URLs.
Crawls a link on a URL without having to crawl the URL itself. This is useful for sites that do not contain related content but contain links to related content.

By default, the crawler does not follow a complex URL, where other parameters are followed by a question mark (?), for example, Http://contoso/page.aspx?x=y. Enabling the crawler to follow complex URLs will cause the crawler to collect more URLs than expected or appropriate. This causes the crawler to collect unnecessary links, which populate the crawl database with redundant links and cause the indexes to grow unnecessarily.

These measures help reduce the use of server resources and network traffic, and can increase the relevance of search results. After the initial deployment, you can view the query and crawl logs and, if necessary, adjust the content source and crawl rules to include more content. For more information, see Manage crawl rules in SharePoint Server 2013.

To crawl the default zone for a SharePoint Web application

When you crawl the default zone for a SharePoint Web application, the query processor automatically maps and returns the search result URL that corresponds to the alternate access mapping (AAM) that executes the query. This makes it easy for users to view and open search results.

However, if you crawl the zone of the WEB application instead of the default zone, the query processor will not map the search result URL that corresponds to the AAM region that executes the query. Instead, the search results URL corresponds to a non-default zone that has been crawled. For this reason, users may not be able to view or open search results at any time.

For example, suppose you have the following AAM for a Web application named WebApp1:

Now, let's say you crawl the default zone Https://contoso. When a user executes a query from https://contoso/searchresults.aspx, the resulting URL in WebApp1 corresponds to https://contoso/, so the format is https://contoso/path/result.aspx.

Similarly, if the query comes from an Extranet area, in this case the result https://fabrikam/searchresults.aspx from WEBAPP1 will correspond to Https://fabrikam, so the format is https:// Fabrikam/path/result.aspx.

In both cases, because of the query location and regional consistency of the search results URL, users can easily view and open search results without having to change different security contexts in different regions.

However, it is now assumed that you are crawling non-default zones, such as the Intranet zone Http://fabrikam. In this case, for queries of any region, the resulting URL from WEBAPP1 will always correspond to a non-default zone that has been crawled. In other words, https://contoso/searchresults.aspx, https://fabrikam/searchresults.aspx, or http://fabrikam/searchresults.aspx Query will produce a search result URL that starts with a non-default zone that has been crawled, so the format is http://fabrikam/path/result.aspx. This will result in unexpected or problematic behavior, such as:

When a user tries to open a search result, the user may be prompted to have no credentials. For example, a form-based authenticated user in an Extranet zone might not have Windows authentication credentials.
The results from WEBAPP1 will use HTTP, but users may search the Extranet zone https://fabrikam/searchresults.aspx. Because the results will not use Secure Sockets Layer (SSL) encryption, this may pose a security risk.
Thin content may not be filtered correctly because it is filtered on the public URL of the default zone instead of the crawled URL. This is because the URL-based attribute in the index will correspond to a non-default URL that has been crawled.

Reduce the impact of crawling SharePoint crawl targets

You can perform the following steps to reduce the impact of crawling a SharePoint crawl target (that is, a SharePoint front-end WEB server):

For a small SharePoint environment, redirect all crawl traffic to a single SharePoint front-end Web server. For large environments, redirect all crawl traffic to a specific group of front-end Web servers. This prevents the crawler from using the same resources as the resources used to render and provide the pages and content to the active user.
Limit the search database usage in Microsoft SQL Server to prevent the crawler from using shared SQL Server disks and processor resources during crawling.

For more information, see Managing Crawl loads (SharePoint Server 2010).

Use crawler impact rules to limit the effect of crawling

To limit the crawler impact, you can also create crawler impact rules, which can be obtained from the Search_service_application_name: Search Administration page. Crawler impact rules Specify the rate at which the crawler requests from the start address or the start address range. Specifically, the crawler impact rule either requests a specific number of documents from the URL at a time, does not wait between requests, or requests one document at a time from the URL and waits for a specific time between requests. Each crawler impact rule applies to all crawl components.

For servers in your organization, you can set crawler impact rules based on known server performance and capacity. However, this may not apply to external websites. Therefore, you may have requested too much content or requested content too frequently, causing inadvertent use of too many resources on the external server. This will cause administrators of these external servers to restrict access to the server, making it difficult for you to crawl or crawl the repository. Therefore, set the crawler impact rule to have as little impact on external servers as possible when you still crawl enough content at high frequencies to ensure that the freshness of the index meets your needs.

Using Active Directory groups rather than individual users ' permissions

The ability of a user or group to perform various activities on a site is determined by the level of permission that you assign. If you add or remove users ' site permissions individually, or if you use SharePoint groups to specify site permissions and change group membership, the crawler must perform a "secure crawl only" to update all affected items in the search index to reflect changes. Similarly, adding or updating a WEB application policy from a different user or SharePoint group triggers the policy to crawl all covered content. This increases the crawl load and reduces the freshness of the search results. Therefore, to specify site permissions, it is best to use Active Directory Domain Services (AD DS) groups because the crawler is not required to update the affected items in the search index.

Add a second crawl component to provide fault tolerance

When you create a search Service application, the default search topology includes a crawl component. The crawl component retrieves the items in the Content library, downloads the items to the server that hosts the crawl component, passes the project and associated metadata to the content processing component, and adds the crawl-related information to the associated crawl database. You can add a second crawl component to provide fault-tolerant functionality. If one crawl component is unavailable, the other crawl component takes over all crawl operations. For most SharePoint farms, a total of two crawl components are sufficient.

For more information, see the following TechNet article:

Overview of search in SharePoint Server 2013
Change the default search topology in SharePoint Server 2013
Managing search components in SharePoint Server 2013
New-spenterprisesearchcrawlcomponent

Manage environment resources to improve crawl performance

After a crawl component crawls content, downloads content to a crawl server (the server that hosts the crawl component), and passes the content to the content processing component, several factors can adversely affect performance. To improve crawl performance, you can do the following:

For more information, see the following resources:

Extended search for Internet sites in SharePoint Server 2013
SharePoint 2013: Crawl extension recommendations

Make sure no crawl is active, and then change the search topology

It is recommended that you confirm that there are no crawls in progress before you start the changes to the search topology. Otherwise, the topology change will not proceed smoothly.

If necessary, you can manually pause or stop a full or incremental crawl, or you can disable continuous crawls. For more information, see the following articles:

Start, pause, resume, or stop crawling in SharePoint Server 2013
Managing persistent crawls in SharePoint Server 2013

Attention:

The disadvantage of pausing a crawl is that the reference to the crawl component remains in the Msscrawlcomponentsstate table in the search administration database. If you want to remove any crawl components, this will cause an issue (because you want to remove the servers that host these components from the farm). However, after you stop crawling, references to the crawl component in the Msscrawlcomponentsstate table are deleted. Therefore, if you want to delete a crawl component, it is best to stop crawling instead of pausing the crawl.

To confirm that there are no crawls in progress, on the search_service_application_name: Manage Content Sources page, make sure that the value in the Status field for each content source is idle or paused. (The value in the Status field of the content source changes to idle when the crawl is complete or when the crawl is stopped.) ）

Remove the crawl component from the crawl host, and then remove the host from the server farm

If the server hosts a crawl component, removing the server from the farm causes the search system to be unable to crawl the content. Therefore, before you remove a crawl host from a server farm, it is highly recommended that you do the following:

Make sure no crawl is active.
For more information, see the previous section, make sure that no crawls are active, and then change the search topology.
Remove or relocate the crawl component on the host.

For more information, see the following resources:

Managing the search topology in SharePoint Server 2013
Change the default search topology in SharePoint Server 2013
managing search components in SharePoint Server 2013 Remove Search component or move search component in
Remove a server from a server farm in SharePoint 2013
SP2010: Removing a server or re-joining a server to a farm can disrupt search

Test crawl and query functionality after you change a crawl configuration or apply an update

We recommend that you test the crawl and query capabilities of your server after you change the configuration or apply the update. The following procedure is an example of an easy way to perform this test.

Test crawl and query capabilities

Verify that the user account that is performing this procedure is the administrator of the Search Service application that you want to configure.
Create a content source that you will only use temporarily for this test.
In the Start Address section of the new content source, in the type a start address below (one address at a line) box, specify a start address for several items that do not already exist in the index, for example, several TXT files on a file share. For more information, see Add, edit, or delete a content source in SharePoint Server 2013.
Start a full crawl of the content source.
For more information, see Start, pause, resume, or stop crawling in SharePoint Server 2013. When the crawl is complete, on the Search_service_application_name: Manage Content Sources page, the value in the Status column of the content source will be idle. (To update the Status column, click Refresh to refresh the Manage Content Sources page.) ）
When the crawl is complete, go to the Search Center and execute a search query to find the files.
If you do not have a search Center for your deployment, see Create a Search Center site in SharePoint Server 2013.
After the test is complete, delete the temporary content source.
The items specified by the content source in the search index are deleted, so the items will not appear in the search results after the test is complete.

Diagnose problems with crawl logs and crawl health reports

The crawl log tracks information about the state of the crawled content. Logs include views of content sources, hosts, errors, databases, URLs, and history. For example, you can use this log to determine when the content source was last successfully crawled, whether the crawled content was successfully added to the index, whether it was excluded because of a crawl rule, or whether the crawl failed because of an error.

Crawl Health reports provide detailed information about crawl rates, crawl latency, crawl freshness, content processing, CPU and memory load, continuous crawls, and crawl queues.

You can use crawl logs and crawl health reports to diagnose problems with your search experience. With diagnostic information, you can determine whether to help adjust elements such as content sources, crawl rules, crawler impact rules, crawl components, and crawl databases.

For more information, see View Search Diagnostics in Sharepointserver 2013.

Article to: "Crawl best Practices in SharePoint Server 2013"

Crawl best practices in SharePoint Server 2013

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More