http://blog.sina.com.cn/s/blog_4d58e3c00100m47l.html
search engine Job candidates necessary questions
Spanish finishing, Welcome to Exchange qq:28484800
Guide
In the search engine interview process, around the search engine itself technology, application characteristics, summarizes the following types of problems: 1 url,2 participle, 3 sort, 4 storage and system, 5 open source system, 6 data mining. From another point of view is around storage, computing, service deployment. It mainly investigates the overall basic knowledge of the interviewee, the index system understanding, the system design analysis ability, the project experience and the practical ability.
A total of two parts, the first part around the proposed 6 types of questions, some representative questions, the second part of a separate from the mass data processing problems and solve the strategy.
Part 6 Questions of the first kind
1 URL Issues
1.1 de-weight
A, b two files, each with 5 billion URLs, how to find two files in the duplicate URL, can not be distributed.
1.2 Reptile Design
How to implement a distributed crawler system that can crawl 500 million video data, provide a minimum of 30 minutes to update the latest video, and draw a schematic diagram. the optimization point of crawler efficiency and the real-time updating strategy of data are described.
1.3 Crawler Maintenance
Design an easy-to-maintain web page extraction strategy, as well as a follow-up maintenance strategy approach.
1.4 Reptile Design
Design a strategy and method for deploying a minimum amount of work in dozens of robotic crawlers and continuously upgrading maintenance.
2. Word breaker problem
2.1 Dictionary Design
Design a dictionary. The user-defined fixed-length structure is stored as a string index. Requirements to increase, delete, check, change the function. A function has been given that can be mapped to a signature by a string, with each signature consisting of two unsigned int types. Assume that each string corresponds to a unique signature, that there is no repetition (or the probability of repetition can be ignored), and that the signature distribution is uniform enough.
Please describe your data structure. How memory is applied. How to realize the function of adding, deleting, checking and changing. If the operation is frequent, how to optimize it.
2.2. Segmentation Strategy
This paper describes a commonly used word segmentation strategy and the industry commonly used word segmentation and characteristics.
3 Sorting issues
3.1 Search and evaluation
Describe the key indicators of a retrieval system.
3.2 Sorting Parameters
Describes what factors affect the ordering of a query's return content. If you need to improve accuracy, which modules or parameters need to be optimized. Describes what data for a Web page has a large impact on the retrieval sort.
3.3 Dynamic Sequencing
Design a sequencing [accuracy] continuous optimization strategy and methodology to ensure continuous optimization of the order under the premise that only the index is likely to be rebuilt, explaining the basic idea.
4 Storage and system problems
4.1 Storage and Access
How to implement a retrieval system that can host 500 million video data that can provide 90% of the requests returned within 200MS to support peak 1000q/s. Describes the basic structure and module policies.
4.2 Storage and Access
How to implement a storage system that can hold 500 million of data. What policies are used, in large volumes of writes, and in the case of large reads.
4.3. System maintenance
How to ensure the maintenance of a service system uninterrupted operation [when the machine maintenance] and upgrade, in the upgrade, switchover, index reconstruction and so on.
4.4. Automated processing
Give a self-learning, self-tuning software system of the basic composition and relevant examples .
4.5 Storage and Access
20 billion data, each data size in 1k~1m, each data has a unique U_int64 ID.
Please design a read Data system that can retrieve data based on the ID. Requirements:
A. Limitations within the 16G
B. Use memory resources whenever possible
C. Get the data as efficiently as possible
D. Disk can be used, unlimited disk capacity
5 Open Source Case Questions
5.1 Explain the basic search technology learning pathways and methods.
5.2 illustrates a method and technology of data flow business processing
5.3 Basic modules and processes and advantages and disadvantages of Lucene and SOLR
5.4 What scalable and optimized points are available for Lucene and SOLR
5.5 nuthc basic modules and pros and cons
5.6 Hadoop basic modules and processes and pros and cons
5.7 hbase basic modules and processes and pros and cons
5.8 Zookeeper basic modules and processes and advantages and disadvantages
5.9 Mareduce Fundamentals and Pros and cons
5.10 Google three pieces of treasure analysis
5.11 On-Machine questions:
Implementation of a simple lucene [also can be other open source components] of the retrieval and query system. Document Support title Content
5.12 on-Machine questions:
Implement a simple crawler crawl with a rule URL page. Save as a file to a folder. Http://www.youku.com/playlist_show/id_[1-10000].html
The file is named according to the Title-id.html method.
6 Data Mining problems
6.1. Describe several methods and applications of data analysis and mining commonly used in search.
The second part of mass data processing
In dealing with the massive data problem, we must first carefully analyze the problem, understand the problem need to solve the key issues, understand the need to achieve the storage, performance requirements, before this, should fully understand the distribution of business data, data granularity, data service quality requirements, data dynamics, data correlation and other real data, business familiarity. I usually think that there are some basic concepts in mind when dealing with massive data problems:
1. Existing open source excellent tools are those that deal with massive amounts of data;
2. Because of the large amount of data, we can consider the partition operation of the massive data.
3. Speed up the access to massive data, the data index must not be;
4. Memory is always limited, memory speed is the best, the establishment of caching mechanism is very necessary;
5. Massive data sources are diverse, data formats are not the same, preferably a unified string processing, logic processing to the upper application;
6. Massive data can not be separated from the cluster, distributed, distributed error-handling, load balancing, there must be a set of feasible mechanisms;
7. All the underlying problems or storage problems solved, in the future to facilitate the application of the upper layer or exaggerated the underlying support of the business, external should have a clear logical view;
8. System design and structure, because of different language, operability in the implementation of difficult to different, this also need to consider;
9. An application of massive data is data mining services, multi-domain data source unified management, data warehousing and related computing should also know one or two;
10. Although storage is not a problem, if the data can be compressed, and acceptable performance, this is why not.
In reference to previous blogs, abstracts and Personal understanding, a summary of the following basic concepts have helped me to face the same employment of students, the future of the company's interview assessment. Of course, with practical experience, the following questions are not a problem, they are experts on a particular issue. Welcome to Daniel's Guide.
The data structure and algorithm ideas with generality are summarized as follows:
1. Bloom Filter
2. Hashing
3. Bit-map
4. Heap
5. Double Barrel division, can be understood as multi-level index
6. Database indexing
7. Inverted indexes (inverted index)
8. Sorting outside
9.trie Tree
10. Distributed processing
Reference:
1 large data volume, mass data processing method summary http://g.51cto.com/880824/85208
2 Summary of classification algorithms in data mining http://www.hadoopor.com/thread-270-1-1.html
3 HTTP://DATABASE.CTOCIO.COM.CN/TIPS/273/8248273.S
4 http://www.google.com.hk