One industry background
With the development of the Internet and the explosive growth of the number of websites, search engines play a more and more important role in people's Internet life. From foreign Google Yahoo Search, to the domestic Baidu, Tencent Soso, Sogou, 360 search, the general search engine market has been divided up. The same generic search engine has higher industry barriers.
One: As a search for the entire Internet search engine, the need for a large number of high-performance servers, the monthly will also cost a lot of bandwidth, its capital investment in the non-general enterprises can afford.
Second: There are high technical barriers in this industry. At present, the technology of search engine is not perfect, its technical level directly affects the user's search experience. Google, as Top1 's search engine, brings together the best programmers and developers in the world. As a graduate student in the direction of information retrieval, this enterprise is also the dream place of our people. Baidu is also in the excavation of research and development personnel to spare no effort, and to raise a high salary. Tencent search in the technical level is obviously weak (but there are a large number of QQ user groups as a support), the results are not ideal. At the same time, as the industry's search engine, and research area of the search engine is still a big difference, its main feature is that the industry's search engine is often used in the field of research has matured technology, but also the various parameters of the fine adjustment. They have a large number of user search records and click Data, you can more objectively test the effects of various parameters. (Sogou published part of the outdated Search records, published as outside research)
Third: People have the idea of using habits and preconceptions. The early years of the QQ and UC dispute proved this point.
Fully three points, it can be said that the general search engine is a small business touch is not a field. So is it in this industry that we can do nothing about it? In marketing, there is a segmentation of the concept of the market is to find a small number of customer base, and targeted optimization, to give a more comfortable university's centralized search results.
Two technology realization
Currently, the most popular subdivision areas are: (1) Vertical search (2) real-time retrieval. Vertical search is a dedicated search engine for a particular industry. Real-time retrieval, that is, the user's real-time demand for results is very high. (here said real-time, generally speaking, in the embedded field, real-time systems are milliseconds of response is called real-time system, and in the search field, real-time system is actually called the weak is the system, the general target site update within 5 minutes to catch data can be considered real-time). At the same time, real-time retrieval is often vertical search, if it is a universal search, it is completely impossible to achieve real-time. (It must be assumed that this search engine server has unlimited processing power and bandwidth to achieve). and vertical search is often only concerned about the same industry representative of a number of websites, processing capacity and the amount of data naturally greatly reduced, so easy to achieve real-time.
Vertical search has emerged in every aspect of our internet life, trying to cite a few examples:
(1) Tianya, in the beginning, that is, by grabbing a large number of other Web site data accumulated a large number of customer base. While this practice is now no longer possible, it cannot be denied what it meant for a 0 data Web site.
(2) A variety of recruitment sites, real estate sites are basically using vertical search related technology, which can make the content of the site more rich. Also easy to attract users.
Vertical retrieval is also different from the general search implementation from the technical point of view. Universal retrieval faces unstructured data that is used in the way that the index is stored. and vertical retrieval through a specific template matching, the capture of unstructured data into structured data, and stored in the database, and the query by the use of database and index combination of the way to achieve. This orderly structured data is the cornerstone of the superiority of vertical retrieval.
Real-time retrieval of real-time requirements leads to changes in the technology of crawling. For the implementation of climbing technology, there are generally two categories, the first type of the most original method is to manually find the target industry site of the latest update list, and a very short interval of time repeatedly crawl this update list, to this list for crawling access to data. The second type is the machine learning method, the site for a period of time tracking, and to obtain the updated frequency of each page information, talk about the most updated frequency of the Web page as the implementation of crawling target pages.
The following is an attempt to illustrate the above technical points through a real-time retrieval case. Bean paste Net (http://www.docshare.org) is a novel real-time retrieval engine, but also a vertical search engine. His main goal is to provide a real-time update of the novel reminders. We'll introduce the system in chunks below:
(1) The reptile part: first according to A5 and so on webmaster website statistics obtains the more popular novel website list, then the manual analysis obtains its newest update list the address, uses Htmlparser and so on open source class library to the webpage link analysis, extracts the book name, chapter name, chapter address and so on the information.
(2) Data storage: The data will be crawled to store the database in the Book table, chapter tables and other tables, and indexed books.
(3) Web front-end: For each user to maintain a bookshelf, and for users to display the shelves of books updated. Record the user's latest reading time and chapters, and give hints when new chapters are available. For a user-given query request, a match is returned from the index to the user, allowing the user to add the results back to the bookshelf.
(4) Provide navigation information according to the classification.
For real estate sites, the technology is basically the same, the different is the entity is not a novel and chapter, but the real estate publishing and demand information. And can be subdivided into rental, rent, sell, buy. According to the type of property can be divided into second-hand housing, new house, faster and so on.
Three profit model
For the general search engine, its main profit model is to match the content of the ads and bidding rankings. In this sense, Baidu is actually an advertising company. His Baidu promotion and Baidu Alliance is its main profit point. Vertical search is often based on different industries, providing different fees, such as real estate sites through the collection of intermediary fees to obtain profits. But the novel search sells the export flow to obtain the income.
To sum up, in the general search engine market over saturation today, small and flexible and intimate vertical search and real-time retrieval is a way out.