How to build a vertical search engine

Source: Internet
Author: User
Tags web database

This article first quotes a few words:
1. "To understand the meaning of the user, return to the user's needs ."
2. "portal websites all think about how to save money rather than how to buy technology ."
3. "search engines are not a field that everyone can do, and the entry threshold is relatively high ."
4. "It's not good enough. The best way is to make one thing the best ." (Google's top ten truths)
5. "You need to focus on search engines." It is difficult for a portal to focus on the fourth business ."
6. "the user cannot describe what he is looking for, unless he is asked to see what he is looking ."
7. "the so-called wedge is actually an inverted triangle. The cutting-edge part of the inverted triangle represents the search technology, and the center is a technology-based product application platform, at the top is the understanding and understanding of the user culture of the entire search engine, as well as the so-called brand that is the most critical and unpredictable in the competition of modern companies." Another implication of "wedge" is: whether the tip is sharp is very important when the wedge hits the wall, but how powerful is the damage of the wedge and how much space can be squeezed out on the wall, the Medium-end and back-end stability and closeness are the key.

 

The technologies and concepts of search engines both require time and experience accumulation, and require long-term improvement and improvement. Never think it can be achieved overnight, it usually takes four years to reach a relatively mature and advanced search engine from the start to the lead. Don't worry. The reason is that the search engine is too complex and "the user cannot describe what he is looking for unless he is asked to see what he is looking ." Everything needs to be explored, tried, and solved one by one. The user needs to mine a little bit.

A search engine is a product that provides services to users. It must be continuously improved, upgraded, and adjusted for a long time to continuously improve the user experience, it must meet the increasing and changing needs of users and constantly adapt to network changes. This is because the network environment is constantly changing and the needs of Internet users are constantly changing. Do not use search as a project. If you have finished renewing the project, you will certainly have nothing to do with it. In the search engine field, it is about experience and new engines. If the user experience leads the gap of more than one year and lasts for two years, the advantages of the leader in the early stage will disappear, because the user transfer cost of search engines is relatively low and word of mouth is the best way to spread. If a search engine is unable to continuously innovate in technological innovation, it will die. We generally describe that the leading search engine is time-based. For example, the overall gap between search and Baidu is X year, while that between Baidu and Google is X year ,...... As long as you can maintain a one-year leader in user experience for two years, there is no need for hype, and everything is coming soon. In the face of user experience, any hype is very small.
As a vertical search engine, Sparrow is small but dirty. No matter the concept culture, product management, application, and technology, there is no difference between the wedge theory of search engines. Therefore, a vertical search must solve these problems.

Wedge tip:Vertical search technology.
Vertical search technology is divided into two levels: Template-level and web database-level. Template-level data is extracted by setting templates for webpages or automatically generating templates. The collection of webpages is also targeted and suitable for small-sized websites with few and stable information sources, the advantage is fast implementation, low cost, and high flexibility. The disadvantage is that the maintenance cost is high and the information source and Information volume are small. The webpage library level is the requirement of the webpage library search engine level in terms of the number of information sources, data capacity retrieval capacity, and stability and reliability. The biggest difference from the template mode is that it does not depend on the specific webpage, attackers can collect information from any normal webpage ....... This leads to a qualitative difference between the data capacity of this method and the template method, but its flexibility is poor and the cost is high. Of course, the template method and the webpage library-level method are not opposite. The two are complementary to each other for vertical search engines, because the technology is only a means to reverse user needs. The technology mentioned in this article mainly refers to the vertical search engine technology at the webpage library level.

Search engines are indeed an application with high technical requirements, and there were few talents related to them a few years ago. Now there are more search technical talents, and the application of related technologies is more mature than before, but the competition is also more intense. Vertical search requires the following technologies:
1. Information Collection Technology
2. Web Page Information Extraction Technology
3. Information processing technologies, including repeated identification, repeated identification, clustering, comparison, analysis, and Corpus Analysis
4. semantic Correlation Analysis
5. Word Segmentation
6. Index

Information collection technology,Vertical search engine spider should be more professional and customizable than Web library spider. Customizable collection and vertical search range-related webpages ignore irrelevant webpages and unnecessary webpages, select the content-related and suitable for further processing of the in-depth collection of web pages, and adjust the update frequency for the page ......, Collection can be done by manually setting the URL and webpage analysis URL. Vertical search has special requirements on Information Update. Based on these features, we can consider the following points: 1. stability of information sources (the Website Cannot feel the pressure of SPIDER) 2. cost of capturing 3. improve user experience. Make a good strategy based on the above points and make it right. In terms of policy, you can evaluate the coefficient of website/web page update, the importance coefficient of website/web page, the user click coefficient (or exposure coefficient), and the website stability coefficient ......, Determine the frequency of updating these websites/WebPages based on these coefficients. Because of the new information and updated information list pages or the homepage, a well-classified webpage can solve the update problem at a low cost. A webpage with a low coefficient will be updated on January 1, January, update once a week at a slightly higher point, from a medium day to a day, from a few hours to several minutes. Similar to the large library, weekly library, daily library, and hourly library of the search engine ......

Vision-basedWeb Block AnalysisTechnologySimulate the display mode of IE browser and parse the webpage. Based on the human visual principle, the results of web page resolution are segmented and then processed as needed, such: collection targeting, Introduction extraction, and extraction of necessary content ......

Structured Information Extraction Technology,Extract unstructured data from webpages into structured data according to certain requirements. There are two methods: the template method, and the web page does not rely on the Web structured information extraction method. These two methods can take advantages of each other and meet the needs in the simplest and most effective way. The biggest difference between a vertical search engine and a general search engine is the deep processing of structured data after the structured extraction of web page information to provide professional search services. Therefore, the technical level of Web structured information extraction is an important technical indicator that determines the quality of vertical search engines. In fact, Web structured information extraction has already been widely used in Baidu and Google. MP3, image search, and Google local search are used to extract enterprise information from the webpage library, google is using this technology to subscribe to its map search method. The same technical applications are also reflected in Qihoo, sogou shopping, shopping, and other applications.

Simple syntax analysis,Simple syntax analysis is very important in search engines. You can use simple syntax analysis to improve data quality, obtain certain types of information at a low cost, improve sorting, and find the desired content ......

Information processing technology,Information processing covers a wide range, including deduplication, clustering, analysis ......, There are many related technologies as needed.

Data mining,Finding out the relevance of your information is very important and effective for vertical search, which can provide users with more detailed services.

Word Segmentation technology,Search-oriented Word Segmentation technology to build a dictionary related to your industry. Note that this is search-oriented word segmentation, not recognition-oriented and accurate word segmentation. There will not be too many maintenance personnel in this work arrangement.

Indexing TechnologyThe indexing technology is critical for vertical search, A Web Database-level search engine must support distributed indexing, hierarchical database creation, distributed retrieval, flexible updates, flexible adjustment of weights, flexible indexing, flexible upgrade and expansion, and high reliability. stability redundancy. You also need to support expansion of various technologies, such as offset calculation.

Other technologies,.

Technical evaluation of vertical search engines should be determined based on the following points:
1. Comprehensiveness
2. Update
3. Accuracy
4. Functionality

Simplified and backward: product application platform and understanding of the search engine culture concept
For any product, the product model is the most important. Technology is only a means, tool, and means. Users will not care about how your technology is implemented, nor about your technical level. As long as the user feels that this is what I need, it is very useful, and it is best to use. Then your product will be OK.
There are many things to consider when considering a product model, such as: what do users need? What are the requirements? Can we fully meet user needs? What resources are required? How? Competition analysis? Differentiation? To what extent can we do according to our own situation? How can we maintain a leading position? Can I receive the money? How to collect money? How to promote it? How long does it take? How can we ensure that the progress can be effectively completed within the time window? How can we prioritize step-by-step completion of users' most needed needs? How can I establish an effective feedback mechanism so that I can understand the changes in user requirements and mine the demands that users cannot express themselves? How can we further improve it? How much investment does it need in installments? How can we reduce both the overall cost and the preliminary cost? How to invest in installments? ROI? Cycle ?......

1. Confirm the meaning of the user
The most difficult part of any application is to understand the user's needs, or even those that the user does not know.
A sound and rapid user feedback mechanism and user demand investigation mechanism should be established. Everyone should listen to users' complaints and suggestions. Constantly analyze and modify data.

2. switch back to user needs
To meet user needs, everything comes to the fore. No hype is required. please spend a lot of your resources to provide a good user experience.

3. Do not interfere with users' intentions and cultivate users' habits and skills
There is a story like this: When Yahoo is still using Google's search, several analysts on Wall Street evaluated which one of the two searches is useful and removed the logo. The results are consistent and the Yahoo search results are well evaluated. Yahoo is the Google search result used, and the hotspot keywords are manually adjusted. However, as soon as they turned around, these analysts returned to their computers to query things and opened Google.

4. Details determine success or failure
The more information, the better. In the massive information age, if information cannot be properly organized, there is no information. The placement of each word, pixel, and image on each page is worth time. Place what users need most in the most conspicuous position, and place them on more pages.

5. achieve the ultimate in one thing
Not only do you need to focus on 80% of the needs of 80% of users, but 20% of the needs of 20% of users are the key to your success or failure.

6. Focus
Are you able to do other things for so many problems you need to solve? You have no chance for a business in the fourth place. Therefore, the success of vertical search engines is certainly not an industry portal with good resources, nor a large search company. It must be a search engine company that focuses on a certain industry. Because only focus can make one thing the ultimate.

7. Innovation
Failure does not matter, but if a search engine company does not innovate, it will inevitably face death.

8. Master the main technologies.
A core business cannot solve technical problems through outsourcing. Although outsourcing technology looks very beautiful, fast, or even low cost for a large company. But this is destroying your future. Because this is a product, not a project. Products need to be constantly improved and adjusted, users' needs need to be explored, and the Internet is also changing. It is absolutely impossible for your outsourcing technology to be flexible and meet various changes in a timely manner. How do you maintain your leading position when competing with competitors? (As mentioned above, if your opponent remains ahead for a period of time, your previous advantages will be lost ). The competition has not yet been considered. if you purchase other search engine companies' technologies, will the other party sell the real technologies to you without reservation. Besides, do you understand how to sell you? Technical difficulties must be solved by yourself. Otherwise, you are doomed to fail. The best solution is to purchase the core technology to shorten the R & D cycle, cost, and risks, and then conduct independent R & D on this core technology.
This is the technical threshold for vertical search. It seems not high, but it is actually very high.
Technical problems can be solved in a roundabout way, and the simplest technology can be used to meet users' most urgent needs. Users do not care about technical implementation.
The template method can be complementary to the Web structured information extraction technology. It is also a good choice for feasible applications to adopt the template technology in the early stage. For example, chinabbs is doing well. The main requirement of users is to browse good posts. Therefore, we need to strengthen content construction, find high-level editors for recommendations, and make good improvements in the interface and ease of use. Qihoo leader. In terms of technology, they used automatic template generation to collect Forum information, which is inferior to Qihoo technology. However, this is not the key to user requirements at present, in addition, Qihoo has a high technical level, but if it is not mature, it is not necessarily strong to reflect what is presented to users. Chinabbs will solve the technical difficulties and improve the technology, so that he can continue to lead. (But again, it is very easy to recruit well-developed editors, and it is difficult to improve the technical level and maturity, and time-consuming. Of course, it takes a long time to cultivate user habits and popularity)

9. Implement users' most urgent needs with the simplest technology

Technology is important, but the proper use of technology is more important, and technology serves the user experience. As long as it can meet the needs of users, any technology is acceptable. Simplicity does not mean that it will not work. The simplest technology is used to achieve the most urgent needs of users. I think Baidu's overall technology is at least one year behind Google's Chinese language, and there are many major gaps, but Baidu's performance is better than Google's, the reason is that simple technologies are used to meet users' urgent needs.

Here is an example to describe simple technical implementation requirements: I demonstrated the Text Extraction Technology of our visual web page block analysis to a friend, who said: we have also implemented it. I was surprised that they did not search, but they did! I was surprised again when he told me how they implemented the solution. I felt that simple technologies could solve the problem well. Although they could not solve the problem completely, they could satisfy their own needs. Their solution is to analyze the HTML of the webpage and extract the text without HTML code from the entire text segment. This is the body. (Marvel !! So easy !! Note: their information sources are in this format)

10. According to the characteristics of China's local Internet,Powerful antispam to clean information.

11.Many peopleMisunderstandingVertical search is to collect the relevant industry web pages for Text Extraction, to achieve search and complete information volume query. This is not the case. If you cannot compete with Web search, web search can easily classify web database by industry or region.

Vertical search should be an in-depth processing and effective integration of vertical industry information to provide users with professionalism and functionality that cannot be achieved by Web search, it provides users with in-depth services and complete experience, and not only provides information retrieval. Vertical search is essentially different from information search.

12. Focus on improving user experienceAny publicity or hype is meaningless. The core of a search engine is the user experience. You only need to improve the user experience, which is a little better than others, the hype and publicity of others are all working for you.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.