How to let search engine crawl Ajax content solution, crawl Ajax
More and more websites are starting to adopt the "single page Structure" (Single-page application).
The entire site has only one page, using AJAX technology, according to the user's input, loading different content.
The advantage of this approach is that users experience good, save traffic, the drawback is that Ajax content can not be crawled by search engines. For example, you have a website.
http://example.com
The user sees different content through the URL of the well number structure.
http://example.com#1 http://example.com#2 http://example.com#3
However, search engines only crawl example.com, ignoring the pound sign, and therefore cannot index content.
To solve this problem, Google proposed a "pound + exclamation" structure.
http://example.com#!1
When Google finds a URL like this, it automatically crawls another URL:
http://example.com/?_escaped_fragment_=1
As long as you put Ajax content on this URL, Google will be included. But the problem is that the "pound + exclamation point" is very ugly and cumbersome. Twitter used to use this structure, which
http://twitter.com/ruanyf
Change into
http://twitter.com/#!/ruanyf
As a result, users complained repeatedly, only spent half a year to abolish.
So, is there any way to keep a more intuitive URL while still allowing search engines to crawl Ajax content?
I always thought there was no way to do it until the first two days I saw the solution of Robin Ward, one of Discourse's founders, couldn't help but simply astounding.
Discourse is a forum program that relies heavily on Ajax, but must let Google ingest content. The solution is to give up the pound structure and use the history API.
The so-called history API, refers to the case that does not refresh the page, the browser address bar to change the display of the URL (exactly, is to change the current state of the page). Here's an example where you click on the button above to start playing music. Then, click on the link below to see what happened?
The URL of the address bar has changed, but the music playback is not interrupted!
A detailed description of the history API is beyond the scope of this article. This is simply to say that the function is to add a record to the browser's history object.
window.history.pushState(state object, title, url);
This line of command allows the address bar to appear with a new URL. The Pushstate method of the History object accepts three parameters, the new URL is the third parameter, and the first two parameters can be null.
window.history.pushState(null, null, newURL);
Currently, major browsers support this approach: Chrome (26.0+), Firefox (20.0+), IE (10.0+), Safari (5.1+), Opera (12.1+).
Here's how Robin Ward is.
First, use the history API to replace the well number structure, so that each well number becomes the URL of the normal path, so that the search engine will crawl every page.
example.com/1 example.com/2 example.com/3
Then, define a JavaScript function that handles the Ajax section and crawls the content based on the URL (assuming jquery is used).
function anchorClick(link) {
var linkSplit = link.split('/').pop();
$.get('api/' + linkSplit, function(data) {
$('#content').html(data);
});
}
Then define the mouse click event.
$('#container').on('click', 'a', function(e) {
window.history.pushState(null, null, $(this).attr('href'));
anchorClick($(this).attr('href'));
e.preventDefault();
});
Also take into account the user clicks the browser's "forward/Backward" button. This will trigger the Popstate event of the history object.
window.addEventListener('popstate', function(e) {
anchorClick(location.pathname);
});
By defining the above three pieces of code, you can display the normal path URL and Ajax content without refreshing the page.
Finally, set up the server side.
Because no pound structure is used, each URL is a different request. Therefore, request the server side to all these requests, all returns the following structure the webpage, prevents to appear 404 error.
... ...
Looking closely at the code above, you will find that there is a noscript tag, which is the secret.
We put all the content to be included in the search engine in the noscript tag. In this way, users can still perform AJAX operations without refreshing the page, but search engines will ingest the main content of each page!
How to let Baidu search engine crawl My site content?
If you are new site, Baidu included is relatively slow. In addition you can go to some other sites to do promotion, in "about two incomes" do a chain link, linked address directly point to your website, that is, the problem of backlinks!
And then it's waiting ...
Google is generally included in the relatively fast, Google included after the estimated Baidu soon!
How not to let the search engine crawl your site information (excerpt)
The first is to create a robots.txt file in your website and directory. What is robots, is: The search engine uses the spider program to automatically access Web pages and obtain information on the Internet. When a spider visits a Web site, it first checks to see if there is a plain text file called robots.txt in the root domain of the Web site, which specifies the spider's crawl range on your site. You can create a robots.txt in your website, stating in the file that the site does not want to be included in the search engine or the designated search engine only contains specific sections. You need to use the robots.txt file only if your site contains content that you do not want to be indexed by search engines. If you want search engines to ingest all of the content on your site, do not create a robots.txt file you may find your site content will still be searched after you have set up your robots.txt file, but the content on your page will not be crawled, indexed, or displayed. Baidu's search results show only the description of your pages on other websites. Prevents the search engine from displaying a snapshot of the page in the search results, but only indexing the page to prevent all search engines from displaying a snapshot of your site, place this meta tag in the page'sPart: To allow other search engines to display snapshots, but only to prevent Baidu from appearing, use the following tags: The format "robots.txt" file for the robots.txt file contains one or more records that are separated by a blank line (with CR,CR/NL, or NL as The Terminator), and the format of each record is as follows: " : ". You can use # for annotations in this file, using the same methods as in Unix. The records in this file typically start with one or more lines of user-agent, followed by several disallow and allow lines, as follows: User-agent: The value of the item is used to describe the name of the search engine robot. In the "robots.txt" file, if there are multiple user-agent records stating that multiple robot are subject to the "robots.txt" limit, there must be at least one user-agent record for the file. If the value of the item is set to *, it is valid for any robot, and in the "robots.txt" file, there can be only one record of "user-agent:*". If you include "User-agent:somebot" and several disallow, allow lines in the "robots.txt" file, then the name "Somebot" is only subject to "User-agent:somebot" The limitations of the following disallow and allow lines. Disallow: The value of this item is used to describe a set of URLs that you do not want to be accessed, which can be either a full path or a non-unprecedented prefix of the path, and URLs that begin with the value of the Disallow entry will not be accessed robot. For example, "Disallow:/help" prohibits robot from accessing/help.html,/helpabc.html,/help/index.html, and "disallow:/help& ... Remaining full text >>
http://www.bkjia.com/PHPjc/869449.html www.bkjia.com true http://www.bkjia.com/PHPjc/869449.html techarticle How to let search engine crawl Ajax content solution, crawl Ajax more and more website, start to adopt "single page Structure" (Single-page application). The entire site has only one page, ...