Deploy your own search engine to achieve keyword optimization

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Search

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Often see webmaster friends to discuss keyword optimization, but also see webmaster friends bother to "accumulate" keywords. Here for you to share a use of the search engine to do keyword optimization experience.

Disclaimer: This experience applies to a certain technical capacity of the station, requiring the server has a Java operating environment, such as Jdk,tomcat. If you do not have these conditions, can also be implemented in a manual manner, but the comparison takes time.

This experience was inspired by the Javaeye forum. When searching for technical information on the Internet, the chances of this forum appearing are extremely high. Click to see, most of them are a list of articles. Use the form of a list of articles to cater to key words, the probability of a match is of course big. And your article is certainly different from the hard heap out of the keyword, search engine to see so many matching formal content, naturally will like. If from the user experience, you put your own database of the most relevant things to show to the user, nature is also the best user experience, compared to let users look for categories, page by page, must be a hundredfold.

The reason is very simple, I focus on how to achieve. My implementation uses the Open source search engine SOLR and the open source Chinese word segmentation system paoding (Chinese contribution Sunding software).

SOLR is based on open source indexing Lucene, both of which are launched by the Apache Open source organization, the official website address is:

lucene:http://lucene.apache.org/

solr:http://lucene.apache.org/solr/(SOLR is a subproject under Lucene, SOLR readable as "sweep")

Incidentally also mention another subproject under Lucene Nutch, this is a search engine crawler, can crawl the intranet or the Internet, very cool! If you want to play advanced collection, do not learn, of course, Java programming skills.

To the point, first of all, the role of Lucene. Lucene is an indexing system that can index content by keyword, which is a bit like a database, except that the latter not only index, but also maintain more other content, such as data relationships. But the latter in the Full-text search function is relatively poor, advanced database system to support Full-text search, and as the majority of the use of the MySQL database, there is no such function. Lucene is not the same, it is not the meaning of storage and maintenance of data, it is born to do the index, so it will be the text content segmentation index. For example, if you have the word "Hello World" in your text, it will parse out the Hello,world two words and add it as an index entry to the Hello and World two keyword index directories. This is different from the SQL like query, the database does not analyze the field content, but only for the field index, when you want to query the field empty, it is actually a string matching, so inefficient, unable to withstand the pressure. So very few people use like big business volume search.

With Lucene, what does SOLR do? Lucene provides only a programming interface, and SOLR is an out-of-the-box thing. Please refer to the official instructions and believe you will soon be able to rack up a SOLR search service. The only place that SOLR needs to configure is Schema.xml, where you have to match the fields you use in the final article of your CMS system, such as author, category, title, content, and so on, along with the types of these fields. Different field types affect their search performance. The SOLR shelf is ready, how to use it? It provides two interfaces (simple bar, only two interfaces), one is update, one is select, corresponding update and query (delete belongs to update). The exact implementation of update is to post an XML document to the base-specified URL. The key section formats are roughly as follows:

1, Add/update (yes, whether to add or update, only need a format.) That is, if the record for the specified ID already exists, update it or add it.

<?xml version= "1.0" encoding= "UTF-8"?>
<add><doc>
<field name=\ "id\" >[your article id]</field>
<field name= "title" >[article title]</field>
<field name= "Content" >[articles]</field>

</doc></add>

2, the deletion is divided by the ID deletion and by the query deletion, the latter deletes all matches the query condition the record.

The id]</id></delete> of <delete><id>[articles
<delete><query>[query string]</query></delete>

Looking at the select again, the query is relatively simple. In general Schema.xml you can define a default query field, which can even be a combination of several fields, so that if you query by keyword only, you will go to these fields. If you want to specify a field, you can query it in the format of the field name]:[query keyword. Complex, it also supports logical combinations, and is interested in looking at related documents. Select Walk is Get interface, you can use get way to send query request, the main field is Q, this is also the major search engines are using the Query keyword field name. What you need to mention is that the result of the select query is in XML format, and you need to write a program to parse the XML document and take out the data. Then just follow the data you read in the database and use it as much as you like. The query results are formatted as follows:

<?xml version= "1.0" encoding= UTF-8 "?>"

<response>

<lst name= "Responseheader" >

<int name= "status" >0</int>

<int name= "Qtime" >1</int>

<lst name= "params" >

<str name= "indent" >on</str>

<str name= "Start" >0</str>

<str name= "Q" > Little broken Child </str>

<str name= "Rows" >10</str>

<str name= "version" >2.2</str>

</lst>

</lst>

<result name= "Response" numfound= "start=" "0" >

<doc>

<str name= "CategoryID" >a8ea126f3128443fbb2d17e0d5e3c55f</str>

<str name= "CategoryName" > Little broken Child </str>

<str name= "content" >&amp;lt;p&amp;gt; children in order to find small ya and over King Yang Hillock, before the post also drank more than three bowls of wine. As the saying goes, three bowls but hillock, store strongly advised small broken child don't cross hillock, little broken child no way, gave a little money store, store just don't say what, and sent a little broken child a stick good dozen tigers. Can a broken child cross the hill? Please see the Little Broken Child series animation short film "Jing Yang Gang". &amp;lt;/p&amp;gt;</str>

<date name= "created" >2009-08-04T17:18:44Z</date>

<str name= "description" > Small broken Child to find small ya and over King Yang Hillock, before the post also drank more than three bowls of wine. As the saying goes, three bowls but hillock, store strongly advised small broken child don't cross hillock, little broken child no way, gave a little money store, store just don't say what, and sent a little broken child a stick good dozen tigers. Can a broken child cross the hill? Please see the Little Broken Child series animation short film "Jing Yang Gang". </str>

<str name= "id" >5ed7054bf108454db2b0216fbc006934</str>

<str name= "keywords" > Jing Yang Gang three bowls but gang small broken child dozen Tigers </str>

<date name= "flushes" >2009-08-27T20:46:09Z</date>

<int name= "status" >1</int>

<date name= "timestamp" >2009-08-27T15:59:48.821Z</date>

<str name= "title" > Three bowls but hillock: small broken child King Yang Gang dozen tiger Remember </str>

</doc>

<doc>

<str name= "CategoryID" >a8ea126f3128443fbb2d17e0d5e3c55f</str>

<str name= "CategoryName" > Little broken Child </str>

<str name= "content" >&amp;lt;p&amp;gt; small broken child after shooting nine Suns, was retaliation, ya ya was Crow Diao to the moon above, all day crying. Little broken Child is very anxious, this how to do? Now, you help the child to save the Ya-ya, the operation of small broken child on the moon, see you! &amp;lt;/p&amp;gt;</str>

<date name= "created" >2009-08-04T17:18:44Z</date>

<str name= "description" > Small broken child after shooting nine Suns, was revenge, Ya Ya was Crow Diao to the moon above, all day crying. Little broken Child is very anxious, this how to do? Now, you help the child to save the Ya-ya, the operation of small broken child on the moon, see you! </str>

<str name= "id" >4c0cfeb8990c455da88aeaabd864bca8</str>

<str name= "keywords" > Little child Ben Moon games </str>

<date name= "flushes" >2009-08-27T16:48:39Z</date>

<int name= "status" >1</int>

<date name= "timestamp" >2009-08-27T15:59:43.021Z</date>

<str name= "title" > Small broken child Ben Moon games, Chang E i come! </str>

</doc>



</result>

</response>

This time there is actually a problem, is that the keyword sometimes matching is not a word. We know that English words are separated by spaces, but Chinese words are more complex, and even some of the words people read will be ambiguous. Lucene is a foreigner's gadget, there is no built-in Chinese word segmentation system, so you search Chinese, as long as there are adjacent to match the string, will hit. This will result in a decline in the matching degree, and then a bad user experience. Maybe you think it's nothing, it's good, one result doesn't leak. But you think, the major search engines are not idiots, your results page matching degree is not high, will affect the weight of your keywords.

Not much said, please the Chinese open source word system Sunding (paoding). SourceForge once because the Chinese people only ask not to contribute and shielding off China's IP, see Sunding, as the Chinese I am proud. Who says the Chinese don't contribute? Sunding and commercial Chinese word segmentation software should still have a gap, but enough for us to do keyword optimization use. You need to add the relevant configuration to SOLR's schema.xml configuration file and match the dictionary path in the paoding configuration file so that your search engine is done.

The following to do the front-end optimization design. You can do some popular keyword tags on the home page, these tabs point to your search results page. The effect can refer to one of my deployed cases: http://www.kaoly.com/t-%E9%BB%84%E9%87%91%E7%9F%BF%E5%B7%A5.html. Explain, some free CMS system also has the label function, even has the search function, but its search function is unable to compare with Lucene, and its label is more manual or semi-automatic maintenance, correlation degree also difficult with search engine directly to search compare. You think that if your search engine algorithm is good enough to be closer to a large search engine, your search results must be the most popular search engine in all your content. I believe we all understand this. Not to mention the convenience of the establishment of the label, you find good keywords, you can add a tag at any time, simply to do a link to it. It is believed that the common free CSM system does not have such a good function. Even if it can automatically search for the creation of tags, its label is not as relevant as the search engine, because it is not designed to do search engines, it is only for you to provide some useful additional small functions.

Write here I will be the search engine to achieve the keyword optimization scheme to introduce the finished. Because it is the introduction of the solution, limited to space and my time, many deployment details are not mentioned. If you are interested, you can refer to the relevant documentation or contact me (qq:1017273876). Finally I wish webmaster nets and webmaster Web site more and more fire, the more money to earn more!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More