After more than half a month of long waiting and waiting, I finally had the honor to go to the headquarters of Baidu building again today. This is the sentence of Baidu.Famous saying: The crowd to find his thousand Baidu, suddenly look back, that person is in the dark-for faith, persistent pursuit, never give up!
This interviewer is a GG. He is very handsome, competent, and confident. His inner wisdom-three pieces of paper, two people, one pen, and no notebook-is the subject of today's interview.
After profound analysis and summary of the lessons of failure, in order to prevent the interviewer JJ from repeating the questions due to the inability to hear the questions, this interviewer is another GG, therefore, you can sit close to GG as much as possible, so that you can clearly, accurately, and completely understand the exact information of the question at a time. On the other hand, it is also convenient to have a full face-to-face communication with the interviewer, avoid falling twice in the same place-repeating the same mistakes. However, in this interview, my worries seem to be redundant: Gg has the right voice and confidence to speak, and his questions, needs, and questions are clear, concise, and clear, gg also emphasized important information about the question when asking a question. Therefore, this interview did not ask the interviewer to repeat the question because he could not hear the question clearly, in addition, I try to give feedback as quickly as possible after understanding what Gg means. In general, this interview is quite easy and has a great deal of communication. gg treats people very well and has a very happy atmosphere.
OK. After introducing the "pre-Pass" of these interviews, Let's go straight to the subject of today's interview.
First, two sets A and B are given, where set a = {name}, Set B = {age, sex, scholarship, address ,...}, requirements: Question 1: query the corresponding attribute information in Set B based on the name in set a; Question 2: query the attribute information in Set B (single attribute, such as age <20 ), query the corresponding name in set.
Second, a file is provided, which contains two fields: {URL, size}, that is, the URL is the URL, and the size is the number of visits to the corresponding URL. Requirements: question 1: use Linux shell commands or design your ownAlgorithmThe URL string contains the size field value corresponding to the "Baidu" sub-string. Question 2: sort the value by size from large to small Based on the query result of Question 1. (Description: The URL data volume is large, with more than 10 billion levels)
Third, test a mobile phone (the mobile phone is an ordinary mobile phone, except for system software, some applications may have been installed)
Fourth, based on the project experience on my resume, I selected a "gloss search recommendation system" project to introduce my architecture and my responsibilities.
Fifth, asked if you are familiarProgramming Language(C/C ++/C #), vs2008, and understanding of Linux Shell and Python are mainly chat and communication methods, which makes it easy
Finally, I asked myself some questions. I mainly asked three questions: what are the differences between box computing and cloud computing? What are the current progress of Box computing? What are the strategic adjustments of Baidu in the face of emerging markets such as group purchasing network and mobile Internet? What are the work of quality testing department and post-interview? what are you going to prepare or learn (these three questions are involved in sequence: 1. Box computing, focusing on Baidu's development strategy and core technological innovation; 2. How Baidu will position and adjust its strategy in the face of the ever-changing and innovative Internet market; 3. If an individual is lucky enough to be hired on three sides or eventually, I may need to learn and prepare knowledge reserves in advance)
Question 1: Recently I have been responsible for processing massive text data in the laboratory (more than GB of log text data) for data mining and analysis, therefore, I directly gave the text processing method that I am using.
Method 1: Use hashtable's {key, value} For ing, that is, use key to record the name attribute in set a, and use value to record the attribute vector in Set B, then, the key is used to query the value.
Method 2: Use the C ++ container class map for one-to-many ing (in fact, it should be multimap. I forgot the one-to-one relationship and corrected it here)
Method 3: Customize the ing relationship. The general idea is the index + vector mode. The vector can use the struct. The index mainly creates an index for the name and improves the query efficiency, it is actually how to establish a one-to-many ing relationship and how to efficiently perform query matching.
Question 2: I didn't have a very complete and efficient solution at the moment, but I gave the interviewer Gg a general idea:
Solution 1: Set A and Set B, read row by row and judge by attribute field. For example, if age is <20, the name is printed; otherwise, the next row is read directly. However, GG pointed out that the disadvantage of doing so is that the whole file needs to be traversed, which is less efficient.
The second solution improves efficiency. I used the inverted index idea to create an index for the key fields and search for the full text (actually the name in Set ), however, GG suggested that index creation would occupy additional storage space, which is not the optimal solution.
Solution 3: in order not to occupy additional storage space, I would like to use the group + Order Method of the database to group and sort the key fields of Set B first, so that no additional storage space is occupied, it is also more efficient than solution 1, but Gg said this is not the best solution, and told me that it is actually very simple, not as complicated as I think, so let me continue to think about it...
So I continued to think for a moment and quickly searched my mind for methods or ideas that would not occupy additional storage space, but also improve efficiency. Spark is popping up, right. Recently, box computing and cloud computing have become very popular. Their design ideas may be usable, so I went on to think about distributed storage computing, improve efficiency with distributed thinking. But Gg says no, it is still not the best solution to problem 2. But fortunately, GG did not continue to damage my brain cells and self-confidence on this issue, but instead moved on to the next question (GG is the weakest link in the interview, instead, it should be transferred directly to the next step in a timely manner. gg's skills and philosophy of asking questions give the interviewer enough confidence and courage to likes)
Well, out of respect for the protection of the intellectual property rights of the technology-based innovation company I worship-Baidu interview questions, and the achievements of the interviewer Gg (super-senior figures, I will not detail the specific details of this interview ~, I would like to explain the general thinking process of the first question and analyze the solution to the problem, so as to encourage everyone to support me and encourage my bloggers ^_^
However, in order to start and end the work, I will give a brief description of the solution and details I can think of in question 2-5. I also hope that you can help me propose better and more optimized methods.
Second, to query the sub-string in a file using Linux Shell, you need to use the grep command to query the content. to separate the URL and size fields of the file, it seems that you need to use the awk command (never used, not familiar). The two are passed through the pipeline | pass the value. However, at that time, I didn't use shell commands (I didn't dare to show them in front of GG experts; otherwise, the problem may not be too good. After the interview came back, after I checked the detailed parameters in man sort, for solutions using shell, see Baidu interview and summary. 4) I used my own design algorithm to solve the problem and split it into three steps (1. Character matching: byte stream reads data row by row, use the sub-string matching KMP algorithm to find all URLs containing "Baidu"; 2. Print the size results that meet the requirement of 1; 3. Based on the result of 2, first, the character type of the field size field value is converted to an integer type, and then sorted by the sorting algorithm. At that time, three sorting algorithms were introduced: binary classification, fast sorting, and heap sorting.
There is also an episode here, where Gg asked me to simply write the Sorting AlgorithmCodeI started to adopt the bipartite method, with the complexity of O (nlog2n). However, before I write the code, JJ first gave me some suggestions on code irregularities, so I started from void quicksort (INT sizearray [], int low, int high), int I, j, TMP ;... the most basic format is written. Although it is troublesome and time-consuming, it takes a lot of time for an interview. However, after all, it takes us a lot of time and wisdom-starting from the details and strictly demanding ourselves. Okay, so I wrote that when I wrote half of it, I suddenly found it wrong. I used to write a binary system, I forgot whether to get the upper limit or lower limit of the boundary value. So I asked Gg in a whisper: I want to write a new sort algorithm, instead of using the binary method, instead of using the Quick Sort. Can I. Gg did not reply to me simply or not immediately. Instead, he smiled and asked me how to write the code half and suddenly wanted to convert the code. The philosophy of asking questions during the interview may be reflected here. GG is very subtle and has the art of asking questions. He asked me when I was writing an algorithm, A series of invisible thinking processes in my mind, such as foresight, thinking, judgment, and selection (Gg's ability to gain insight into problems, skills in question, and art are also worth learning) so I also directly told Gg about my weaknesses and worries. I am afraid that after binary insertion is sorted and moved, I forgot the details of getting the superscript or subscript to insert the new value. So I want to use the quick rank algorithm that I am familiar with, so that I don't have to worry about the upper and lower limit selection problem, and I want to use recursion in the quick rank, the code is concise, clear, and easy to understand. Gg smiled and agreed to my request. Haha, so I wrote the code in kubernetes and gave Gg detailed details about algorithm traversal and recursion. (The specific fast sorting and binary method (this indicates semi-insertion sorting) and the Implementation Code of heap sorting. For details, refer to the implementation of various basic algorithms in my original blog post (V)-sorting algorithms)
Third, I also asked questions at one side. But JJ asked how to test an open-air vending machine. At that time, I thought about how to test the hardware part, but did not say it, and directly jumped to the test software part. This time I also learned from the disadvantages of incomplete narration and inadequate expression, but clearly gave the need to test hardware and other physical features. In fact, we only need to grasp three basic principles for testing: 1. Division of equivalence values; 2. Boundary Value testing; 3. Error in experience evaluation (of course, managers also need to consider economic factors such as benefits and progress) and then follow these principles to analyze, think about test cases, and summarize. For example, a mobile phone is divided into two parts: hardware and software testing. Hardware testing includes temperature, buttons, Boot, anti-drop, anti-seismic, anti-touch, and other physical features. The software must be divided into two parts: system software and application software, because the functions and usage of system software and application software are different, you need to consider the priority of testing, from key software (such as phone book) to secondary key software (such as e-books) then to non-key software (such as games), from frequent to less frequently used Frequency Division, such as division, in order to plan, arrange, and test. At that time, I introduced hardware, software, system software, application software, Baidu software and other application software, as well as detailed tests on Baidu mobile phone input methods on Android phones. Core Idea: equivalence division-Boundary Value Test-frequent or easy bug tests based on experience (if you want to learn more about the core ideas and methods of software testing, the art of software testing, written by the master Myers, is recommended)
The fourth approach is to provide detailed information about the design ideas, implementation processes, some functional implementation details, and technical methods of the project. Gg asked me about a search recommendation project on my resume (now my mentor has asked me to transfer it to a senior developer, as his graduation thesis project) I have explained in detail the basic principle of using SOLR for full-text search and some functional modules I am responsible. Collect data sources (graduation thesis of Chinese Emy of sciences, IEEE international conference, various academic journals, etc.), establish index and database storage by SOLR (SOLR is an improved system encapsulated on Lucene full-text search open-source system), Asp. net (C. I gave a detailed introduction to these functional modules in turn. Gg seemed to be very interested in search and sort, so he went into depth and asked details about SOLR establishment and sorting. So I took the jobs I was in charge of, such as analyzer word segmentation and semantic analysis; the full text of the document is indexed and stored through the field. The string word segmentation extraction and sorting of the user search parser are described in depth. This system design question mainly examines the overall idea of the system design by the interviewer, as well as the practical ability to analyze and solve the problem, and to master the application level of technical details of functional modules. Gg asked these questions, all of which were targeted and focused on the overall, core, and details of the system, we also raised the disadvantages of using the MySQL database to store data at the beginning of our design (this problem raised by GG is indeed a problem that our mentor recently wanted to use text for massive data storage) -- It is very predictable and worthy of my learning in future system design ..
Fifth, chatting is the main thing. It's easy and pleasant. I talked about some familiar programming languages (C/C ++/C # And Python) familiar with the compiling environment (vs2008 and Linux GCC), C # interface (client and webpage development), database design (familiar with MySQL and SQL Server database design, and independently designed database systems with complex logic)
Finally, in the free question session, I did not ask naive questions such as "space capsule" sleeping this time, but focused on Baidu's product strategy and technological innovation (as mentioned above, I will not repeat it here)
Well, the interview is easy and pleasant, the communication is smooth, and the communication is friendly, but what impressed me most was Gg's interview style-three pieces of paper (two of them my resumes), two people (Gg and I), and one pen (a language for mutual communication) without a notebook (how will he record my interview process)-simple, straightforward, and competent, reflecting the wisdom and talent behind Gg's self-confidence, I finally saw my technical experts at the top of my mind. I have to ask him more and learn from him later. ^_^ at the same time, from Gg's interview style, reflecting Baidu's core value: simple and reliable