Summary of vertical search engine R & D experience

Source: Internet
Author: User
What is vertical search engine?

Vertical search engines are search engines targeting a certain industry. They are refined and extended by search engines and integrate certain types of information in the Web library, extract the required data from the field, perform secondary processing and indexing, and return the search result based on the user's request.
Compared with common web search engines, the biggest difference between them is that structured information is extracted from webpage information, that is, unstructured data of webpages is extracted into specific structured information data, for example, a web search uses a web page as the smallest single bit, while a vertical search uses a structured data as the smallest unit. This structured data can be called a record. Then, the data (record information) is further processed, such as deduplication and classification. Finally, Word Segmentation and indexing are used to meet users' needs by searching. Obviously, this process is similar to traditional database retrieval. However, the traditional database search is based on string matching and has no relevance sorting. Of course, traditional data retrieval also has its own advantages, supporting complex table join operations. Vertical search engines are relatively weak in this regard. It can be said that vertical search engines are based on the common search and Database Retrieval solutions under actual needs.

General vertical search engine process:
Targeted crawling ==> webpage Information Extraction ==> secondary processing and Word Segmentation ==> indexing and retrieval ==> relevance sorting

Vertical search features:
(1) The data captured by the vertical search engine comes from the Industry websites that the vertical search engine is concerned:
For example, the data from www.deepdo.com, www.51job.com, www.zhaoping.com, www.chinahr.com, and so on are from www.jrj.com.cn and www.gutx.com;
(2) data captured by vertical search engines tends to be structured data and metadata:
For example, if we are looking for a job, we should pay attention to: job information: software engineer; company name; Industry name: software company, outsourcing industry; Location: Beijing, Haidian;
(3) vertical search engines search based on structured data:
For example, find the work of Haidian software engineer.

Why do we need to develop our own vertical search engine platform?
(1) Lucene weakness: it does not support distribution, slow speed, and poor performance.
(2) vertical search engines, where the index object is structured information (record), must support field indexing and retrieval, while general data engines do not support this function.
(3) proprietary intellectual property rights
(4) complex services must be supported

What is vertical search engine R & D? (Key: indexing and retrieval)
(1) Business demand analysis and Abstraction:
(2) overall architecture design:

Data Service Platform Module: Responsible for unified management of engine data and receiving formatted data for retrieval generated during the process of indexed data and indexing with type identifiers;
Cacheserver module: responsible for data interaction (receiving requests and analyzing requests) between the Web Front-end and the engine background, and caching retrieval results;
Middleserver module: forwards retrieval requests and merges the retrieval results returned by each indexsearch to complete relevance sorting;
Indexsearch module: Analysis and retrieval request ==> phoneme series ==> merge the retrieval results of various phoneme ==> [Other filtering] ==>
Indexbuilder module: Index input data according to configuration, and generate inverted index data

(3) detailed design and coding:
(4) knowledge points required: linked List, stack, queue and priority queue, hash table, B + tree, fast sorting, heap sorting, file-based external sorting, inverted Sorting index, multithreading and mutex, socket programming (select and epoll), system programming (directory, file operation and management) and so on.

Vertical search Development Learning video

Link: http://pan.baidu.com/share/link? Required id = 3520653814 & UK = 3611155194 password: 3fgz




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.