Amoy Net (www.etao.com) on October 9, 2010 10:39 officially on the line, many students and peers to a Amoy system architecture and the key technical issues are very interested in, this essay would like to give a brief introduction.
System Architecture
A Amoy system architecture is shown above. Can be seen, a Amoy three data sources: the Internet, external partners and Taobao master station. The Internet data is obtained through crawl, and both are provided through feeds.
Crawl system functions include: Web crawl, crawl scheduling, domain name resolution, dead chain detection, JavaScript implementation. At present, a Amoy information, topics, questions and answers combo Most of the data are obtained through the crawl system from the Internet. It is an important "raw material factory".
The off-line processing system is a function of many, pipeline can be flexibly customized, its main functions are: Web page code recognition and conversion, Web page resolution and content extraction, shopping related site discovery, List page recognition, Web page classification and weight, link extraction and merging, keyword extraction, many web static feature extraction. It is a Amoy "processing plant."
The storage system is responsible for storing the output of the crawl system and the off-line processing system, while providing high-performance, high-capacity access services to both "plants". At present, we are using the Hadoop+hbase architecture, the Web page, links, pictures are classified storage. The storage System is a "core warehouse" for storing raw materials and semi-finished products.
The online engine is responsible for returning query results to a Amoy front-end search request, which generates indexed data from the storage system. Online engine is a user-oriented "finished production workshop." It is worth mentioning that a new generation of Ali group HA2 engine technology, HA2 combined with the open Source engine and Ali the previous generation of engine technology design advantages, in support of Full-text search at the same time, both the various functions of product searching. The main features it currently offers are:
Data size: Supported data scale from one machine (partition) to hundreds of machines; Update speed: Support for full data updates, as well as the fastest incremental update to support minute levels; Data type: Allows the user to define a variety of data types, from a single field to dozens of fields. The type of field can be text, string, number, and so on; query syntax: Support simple single condition query, complex various conditional combination query, filter, correlation calculation: Support up to three-stage correlation computation, provide rich information for user to customize each phase of the calculation method Statistical navigation: Supports flexible packet statistics and intelligent navigation for retrieved results.
Amoy front-end is responsible for the end user display search results page, it is a Amoy "store", with a variety of Windows: Merchandise, Amoy bar, information, forums, questions and answers, pictures, web pages. The mechanism to ensure that the store works properly includes:
Bootstrap: Responsible for query word legality check, code recognition and conversion, stop word and forbidden word filter. Query Planner: Responsible for search word rewriting (Rewrite), Main-word recognition, commodity-category prediction, combo sorting, capitalization conversion, synonyms and polysemy, and more. Rmod: Responsible for initiating concurrent requests to various backend service interfaces and consolidating the return results for page presentation. Cache: Responsible for distributed cached search result data to shorten response time and improve throughput of front-end system.
In addition, in order to Amoy the team's operational efficiency, we are also building a set of "from the collection query and click Log Start, data statistics, correlation analysis, abnormal alarm and manual adjustment and other related processes," the query-centric operating tools.
A Amoy small two people know: how to make the display of these windows show more and more accurate "shopping guide relevance", is a Amoy-user-oriented core value. How do we get into the positive cycle of sustainable development in this direction? Our current thinking is: to build a combination of "query analysis" and "web Analytics" multi-level sorting model, in order to ensure the relevance of the premise, flexible and rapid adjustment of the model structure to adapt to changing business needs.
The purpose of query analysis is to understand the user's query intent and to translate this intent into the information available at the time of sorting to affect the final sort result, such as:
Browse Type: No clear shopping objects and intentions, while looking at the buy, users more casual and perceptual. Query for example: "2010 10 Big Perfume Rankings", "2010 Popular Sweater", "Zippo how many kinds?" Query type: Have a certain shopping intention, embodied in the requirements of the attribute. Query For example: "For the elderly mobile phone", "500-dollar Watch"; contrast: has narrowed the shopping intention, specific to a few products. Query For example: "Nokia E71 E63″," AKG k450 Px200″; definite type: The basic decision has been made to focus on an object. Query For example: "Nokia N97″," IBM T60″.
With more and more users, we will further explore the user query requirements, expand the type of intent analysis.
Web page analysis is expected to be: the quality of web pages, the authority of the site, the content of the keyword, whether for shopping articles. This information is combined with the output of query analysis to participate in the sequencing of search results dependencies at different levels.
Amoy is creating a set of "User behavior/Model promotion" of the self-circulation system, which mainly user behavior, supplemented by improved model improvement process and rich related platform tools, expect this can increasingly automated, continuously improve the relevance effect, more intelligent to meet the user's search intentions.
The relationship between
and Taobao
Taobao is currently the absolute leading market occupancy rate, can make full use of Taobao site data, for a Amoy is undoubtedly very important and very fortunate.
From the system architecture, a lot of large data on the offline computing task is in Taobao thousands of Hadoop based on the distributed computing platform to complete, on its access to Taobao goods, transactions and user data is a very convenient thing, The powerful computing and storage capabilities of the platform have further stimulated the imagination and creativity of the engineers. For example: The first Amoy Taobao user search query words and direct purchase of the baby associated with the implementation of the minute level of engine updates, which provides users with the most timely guide vane. In addition, a Amoy also directly calls a lot of online service interface, such as: Baby Search, Product search, combined with the same paragraph.
From product services, a Amoy Taobao master station and the whole network E-commerce site links to the important link. Simply speaking, Taobao station data (such as: Product Library, the category of the system can guarantee a Amoy shopping guide search relevance has a very positive role in promoting; a Amoy through open search and extranet merchandise information crawl, but also for other e-commerce sites to bring more high-quality business traffic; and the internet's goods, information, Forums and other information to help a Amoy search results more comprehensive, information more authoritative; the improvement of the search quality in turn can help to improve the user experience of Taobao (such as: no results page, pre-purchase research), a Amoy user behavior analysis and trend prediction can also be as Taobao operations collect feedback information important channel.
Concluding remarks
Through the above introduction, we can not be difficult to understand the use of the technology used in the practicality, efficiency and scalability of the industry will have the leading requirements. The main areas of concern include:
mass Web page fetching and extracting distributed storage and computing platform large-scale data (web/commodity) processing and analysis shopping search correlation system High performance customizable Full-text search engine quick response to business requirements front-end architecture
These technical directions, we will be in the future of the blog to further expand, more in-depth elaboration.
Source: http://www.searchtb.com/2010/11/etao-tech-overview.html