How to enable search engines to capture AJAX content
More and more websites begin to adopt Single-page application ).
The entire website has only one webpage. Ajax technology is used to load different content based on user input.
The advantage of this approach is that the user experience is good and the traffic is saved. The disadvantage is that AJAX content cannot be crawled by search engines. For example, you have a website.
http://example.com
You can view different contents through the URL of the well number structure.
http://example.com#1 http://example.com#2 http://example.com#3
However, the search engine only crawls example.com, does not care about the well number, and thus cannot index the content.
To solve this problem, Google proposed the "Well Number + exclamation point" structure.
http://example.com#!1
When Google finds a URL like this, it automatically crawls another URL:
http://example.com/?_escaped_fragment_=1
Google will include your AJAX content on this URL. But the problem is that "Well Number + exclamation point" is very ugly and cumbersome. Twitter once used this structure.
http://twitter.com/ruanyf
Change
http://twitter.com/#!/ruanyf
As a result, the user complained about the connection and abolished it in just half a year.
Is there any way to enable the search engine to capture AJAX content while maintaining a relatively intuitive URL?
I never thought I could do it until I saw the solution of Robin Ward, one of Discourse's founders two days ago.
Discourse is a forum program that relies heavily on Ajax, but must be indexed by Google. The solution is to discard the well number structure and use the History API.
The History API is used to change the URL displayed in the browser address bar without refreshing a new page (to be precise, it is to change the current status of the page ). Here is an example. You can click the button above to start playing music. Then, click the link below to see what happened?
The URL in the address bar has changed, but the music is not interrupted!
History API details are beyond the scope of this article. To put it simply, it adds a record to the browser's History object.
window.history.pushState(state object, title, url);
The above command will display a new URL in the address bar. The pushState method of the History Object accepts three parameters. The new URL is the third parameter, and the first two parameters can be null.
window.history.pushState(null, null, newURL);
Currently, all major browsers support this method: Chrome (26.0 +), Firefox (20.0 +), IE (10.0 +), Safari (5.1 +), and Opera (12.1 + ).
The following is the method of Robin Ward.
First, use the History API to replace the well number structure and change each well number to a normal URL, so that the search engine can capture every webpage.
example.com/1 example.com/2 example.com/3
Then, define a JavaScript function to process the Ajax part and capture the content based on the URL (assuming jQuery is used ).
function anchorClick(link) {
var linkSplit = link.split('/').pop();
$.get('api/' + linkSplit, function(data) {
$('#content').html(data);
});
}
Then define the mouse click event.
$('#container').on('click', 'a', function(e) {
window.history.pushState(null, null, $(this).attr('href'));
anchorClick($(this).attr('href'));
e.preventDefault();
});
You must also consider clicking the "forward/backward" button in the browser. This will trigger the popstate event of the History object.
window.addEventListener('popstate', function(e) {
anchorClick(location.pathname);
});
After the above three pieces of code are defined, the normal URL and AJAX content can be displayed without refreshing the page.
Finally, set the server side.
Because the well number structure is not used, each URL is a different request. Therefore, the server must return the following webpage structure for all these requests to prevent 404 errors.
<body>
<section id='container'></section>
<noscript>
... ...
</noscript>
</body>
Looking at the code above, you will find a noscript tag, which is the secret.
We put all the content to be indexed by the search engine in the noscript tag. In this way, you can still perform AJAX operations without refreshing the page, but the search engine will include the main content of each page!
How can Baidu search engine capture my website content? If you are creating a new site, Baidu indexing is slow. In addition, you can go to some other websites for promotion and create an anchor link in "Hongjian dual salary". The link address directs directly to your website, that is, the reverse link problem!
Then wait ......
Google indexing is usually faster. Baidu is estimated to be faster after google indexing!
How to prevent search engines from capturing your website information (excerpt)
First, create a robots.txt file under your website content. What Is robots? Search engines use the spider Program to automatically access webpages on the Internet and obtain webpage information. When a website is deployed, spiderwill first check that the root domain of the website contains a plain text file called robots.txt, which is used to specify the scope of spider crawling on your website. You can create a robots.txt file on your website, declare the part of the website that you do not want to be indexed by a search engine, or specify a search engine to include only specific parts. The robots.txt file is used only when your website is not covered by the content of the search engine. If you want to search for all content on the web site, do not create the robots.txt file. After you have created the robots.txt file, you will find that your website content will still be searched, but the content on your web page will not be crawled, indexed, or displayed, the baidu search results only show the descriptions of Your webpage on other websites. To prevent a search engine from displaying web snapshots in search results, you can create an index on a web page to prevent all search engines from displaying snapshots of your website, place the meta tag in the <HEAD> section of the webpage: <meta name = "robots" content = "noarchive"> to allow other search engines to display snapshots, but only to prevent Baidu display, use the following tag: The <meta name = "Baiduspider" content = "noarchive"> robots.txt File Format "robots.txt" contains one or more records, these records are separated by blank rows (with CR, CR/NL, or NL as the terminator). The format of each record is as follows: "<field>: <optionalspace> <value> <optionalspace> ". In this file, you can use # For annotation. The usage is the same as that in UNIX. The record in this file usually starts with one or more User-agent lines, followed by several Disallow and Allow lines. The details are as follows: User-agent: the value of this item is used to describe the name of the search engine robot. In the "robots.txt" file, if there are multiple User-agent records, it indicates that multiple robots will be restricted by "robots.txt". For this file, at least one User-agent record is required. If the value of this item is set to *, it is valid for any robot. In the "robots.txt" file, only one record such as "User-agent: *" can exist. If you add "User-agent: SomeBot" and several Disallow and Allow rows to the "robots.txt" file, the name "SomeBot" is only subject to "User-agent: restrictions on Disallow and Allow rows after SomeBot. Disallow: the value of this item is used to describe a group of URLs that do not want to be accessed. This value can be a complete path or a non-empty prefix of the path, URLs starting with the Disallow value will not be accessed by the robot. For example, "Disallow:/help" disables robot access to/help.html,/helpabc.html,/help/index.html, and "Disallow:/help &... the remaining full text>