The company uses ANGULARJS (hereinafter refers to NG1) framework did the Internet application, did not contact the SEO, suddenly one day operation over there came the task: to the site to do search engine optimization, need research and development support. Search the discovery of a single page application to do SEO more laborious, domestic related practice data sharing out also relatively few, slightly Meng, before and after spent a bit of kung Fu finally finished. Record here, do a summary, also hope to help in doing similar work of friends less go a little detour. Or suggest the need to SEO site technology selection try not to use angular react a class of single page frame. If you and I like the site finished the discovery needs SEO, then look down. If you have a better plan to welcome the brick-making exchange.
where is the difficulty of single page application SEO?
Do SEO must understand the basic principles of crawler work. Search engines can search a Web page because it is indexed, and before this requires crawler crawl to the site page is stored as a snapshot, the content of the snapshot is the static content of the page. In general, right-click on the page source code to see what the crawler can crawl. The crawler gets a URL and crawls its page information, finds the a tag in the page, gets the next URL jump address, and continues the next page crawl. SEO work is to increase the search engine index of the site and improve the ranking of web pages, traditional SEO work such as the site of TDK's optimization, website URL optimization, the increase in the chain is to achieve these purposes. To do these have a common premise is that the Web content can be crawled by the search engine, and the single page application of SEO difficulties are stuck here.
If your app is developed with a single-page app like Angularjs, right-click on the source code and you'll find that the site has no dynamic data. Unfortunately, the search engine to crawl the page will also be the case. This is mainly because of the two point reason:
routing, templates, and AJAX requests
Angular implementation of a single-page scenario is the use of a routing mechanism with the template engine. With a custom template, an app has only one master page, which switches between different states by routing, nesting the corresponding template engine. The dynamic Data in the template is obtained from the backend via AJAX requests. This is the process of jumping from a route to rendering a full page, except for the basic static data on the main page, and the rest is all done by JS.
Crawler does not execute JS
The first one is clear, and it's obvious when you see it. Unfortunately, the crawler does not execute JS script, this is not difficult to understand, the search engine every day there are a huge amount of pages to crawl, the implementation of JS will greatly reduce the efficiency, another search engine execution JS script also has a huge security risk.
Search engine to get a URL, get, end, just get the main page in a few lines of static information. The angular framework maintains the routing, the main page, and the front end like the Ajax requests initiated by the backend, and so on, the search engine will not deal with the work.
URL optimization
The crawler can be put below, first of all to talk about URL optimization. Angularjs knows that the URL of Ng is to identify a state by #. The URL with a # similar symbol is very unfriendly to SEO, and according to colleague response (I did not verify), the search engine when visiting the URL will not take the content after the # to visit. In short, URL optimization is a single page application SEO is not open to a work, and our goal is to optimize the URL as a URL like the www.xxx.com/111/222/333 directory structure, it is the crawler's favorite form.
How to remove ng frame URL #,google and Baidu are able to search a lot of information. such as: http://blog.fens.me/angularjs-url/
Simply put, remove # only needs to configure $locationprovider.html5mode (TRUE) in the route; When HTML5 mode is turned on, the URL will automatically remove # and the. html suffix to optimize. However, there is a problem: F5 refresh will not find the page 404, because F5 will submit the URL to the backend to get the resources, and HTML5 mode optimized URL in the backend does not exist such a resource, direct access to the link will not be found on the main page, nature will be 404. The above link gives the scenario is the Nodejs back end of the scheme, our solution is to use the Springmvc backend, but the principle is similar: the backend does not know this link, we will redirect the wrong connection to the original # connection, for the backend is a normal access, and the URL of the # The browser side will be removed again by the HTML5 mode.
The work of redirection can be resolved in the filter of the backend Springmvc, or in the container. Our framework is the backend with nginx load balancer, I will redirect in the nginx.conf, the URL of each route state has done the corresponding original URL redirection, problem resolution. Regardless of refresh, access, the page is a simple and comfortable directory structure URL.
Two types of solutions to capture
After the URL is optimized, continue looking down. What we are going to do is a single-page application of the crawler, that is, how to enable the search engine to obtain the full content of the page information. I have researched some of the existing solutions, and the ideas are similar. Search engine does not execute JS, we can not change, then we only like baby care, we will be JS execution, get templates and dynamic Data rendering a completely static page, to the crawler. I've researched two programs on git and shared them, and if you have a better plan, you're welcome to share it.
Scheme I,johnhuang-cn/angularseo
Https://github.com/johnhuang-cn/AngularSEO, this is a Java back-end solution. Mainly divided into two pieces: server-side filter, local crawler. The server filter has two functions: one is to get the URL, the second is to identify the search engine request and redirect to the local snapshot; The local crawler is to render the page as a local snapshot. The workflow is roughly as follows:
Config on filter in Web. XML, the filter crawls to the URL when the site is first visited, and gives it to the local crawler. This crawler is a crawler with Dynamic Data capture, mainly using the SELENIUM+PHANTOMJS framework, about which the two frameworks can be self-Google, where Phantomjs is a webkit kernel. It is able to crawl dynamic data because it can get DOM element execution events and related JS. After obtaining the full page information, it will store the URL of the shape: Http://abc.com/#/about to the locally named Http://abc.com/_23/about snapshot. That is, we need to wait for the local crawler to render each URL as a local snapshot, and then when the crawler accesses it, the filter redirects to the corresponding snapshot page. How do filters identify crawlers? is through the HTTP header in the useragent, each search engine has its own useragent, within the filter configuration can be.
Pros: This scheme has several advantages. 1, the deployment is relatively simple, for the Java application, the configuration is relatively convenient and simple. 2, the search engine access efficiency is faster, because the snapshot has been saved, the search engine to crawl directly will return to the static page.
Cons: The scheme also has several drawbacks. 1, local crawler crawl speed is slow, for we have a huge amount of dynamic data such as information module, saving snapshots is a time-consuming work. 2, real-time, the framework by configuring the local crawl frequency to update the snapshot, meaning that the search engine crawl page real-time is limited by the update frequency. 3, stability, do not know whether there are still these problems, may be due to the framework is not very mature, I in the trial, the local crawler activation is not stable enough, and another PHANTOMJS process has been unable to quit the phenomenon, causing the background to open a large number of phantomjs memory exhaustion. 4, distributed deployment problem, we use nginx load balancer to do the back-end cluster, the search engine came after the rules assigned to different backend, resulting in the use of this framework in each back-end deployment, causing a series of inconvenience and problems.
Due to the above disadvantages, this scheme was finally abandoned by me.
Programme II, Prerender.io
This program is the relatively mature solution I found in the research process, which solves my needs more perfectly. The schematic diagram is as follows, for reference: http://www.cnblogs.com/whitewolf/p/3464555.html
Prerender.io is also divided into two blocks, a client and a server-side PreRender service, the client's job is to identify search engine requests and do redirects (similar to the scheme), in addition to useragent it will also pass escaped_fragment To do search engine recognition, this is a set of Google's crawler, details are visible:Google's Ajax crawling protocol. If your site is mainly to do domestic browser optimization can be ignored, the single use of useragent enough. The PreRender server is responsible for receiving the client redirect request, to get the request again to the Web backend to make the request, similarly, the PreRender back-end integration Phantomjs, to execute JS to obtain Dynamic data. After getting the full page data, the PreRender backend returns the full page to the search engine.
PreRender's clients now have a number of technology implementations that can meet a wide range of technical solutions. Server has two options: 1, the use of the official Prerender.io cloud Services 2, to build their own prerender backend services. The introduction of the official Web site recommended the use of cloud services, but the official website needs FQ, and the use of other people's service stability is always worrying, I chose to deploy PreRender services. It's actually a nodejs process that runs alone, and it's stable after running with forever instructions.
Advantages: 1, high real-time, through the above analysis can understand PreRender service is real-time to get search engine request to do page rendering, which means that after a deployment if the site does not have a large change will not have a follow-up work, the search engine to get every return content and user access is the same as the latest data. 2, distributed deployment, PreRender service is a completely separate from the Web application of a process, regardless of how many of the backend cluster and deployment do not affect each other. 3, stability, the framework has been relatively mature, caching mechanism, black and white list experience is very good. PreRender service after running with Forever Daemon basically did not encounter the problem of instability.
Disadvantages: 1, search engine crawl efficiency, the page is real-time rendering, search engine crawl up naturally slower, but this we do not care.
Compare two scenarios I naturally chose the PreRender solution, and the process of deploying the practice also encountered a series of problems, if necessary I will write PreRender deployment practice article.
Summarize
Overall, single page application can crawl the biggest problem is that the search engine does not execute JS, the solution is nothing more than we do dynamic Data rendering and then feed to the crawler. This is not a difficult thing to do even if you are going to complete such a framework yourself.
The principle analysis of SEO optimization and grasping scheme of single page application based on Angularjs