I sorted out two simple questions about the search engine this evening. All questions are from the iveely search engine. Share your wisdom with you! It is not difficult, but we hope to find the best solution.
Question 1:
Background:
In the user search process, we split the user's keywords and then matched them. For example, if you enter" Program "Life", after word segmentation, we will get "program" and "life". We can extract the webpage set corresponding to the "program" (9.00235, 123.00691, 96. 00035 ...), and the web page set corresponding to "life" (6.00025, 123.00128, 95. 00245 ...), the integer part is the Web page number, and the fractional part is the actual weight (value) of the keyword under the web page. Next, we will merge the web page set of "program" and "life, then feedback to the user.
Problem: In the process of merging, you may encounter the same web page. When you encounter the same web page, we add the fractional part, and the integer part remains unchanged. If the fractional part is greater than 1, multiply the fractional part of the entire set by 0.1, and then accumulate.
Problems to be Solved: Please design a Data Structure to solve the above problems with the lowest possible time complexity and space complexity.
Question 2:
Background:
In a search engine, each keyword corresponds to countless webpages, and each webpage corresponds to several keywords. After a search engine obtains a keyword, you must obtain a set of webpages with this keyword in the fastest possible time. Currently, the most common practice is reverse sorting. However, in reverse sorting files, although the keywords of the web page can be quickly extracted, the weight of the web page may not be the same. That is, the structure of the objects to be sorted is unordered.
Next, we will abstract the problem as Beijing subway station information. Every site is a keyword and every line is a webpage. Each site is contained by multiple lines (each keyword is included by several webpages), and each line contains multiple sites (each webpage contains multiple keywords ).
Problem generation:
The inverted file allows us to quickly extract the line corresponding to the site, but unfortunately, for example, the user will return to Metro Line 2 after searching for the Xizhimen, metro Line 4 and Metro Line 13. However, there is another intersection between Metro Line 4 and Metro Line 2: Xuanwu gate. Why do we need to know xuanwumen? In the iveely design, the author thinks that when the intersection site in the search results is more concentrated and reaches a certain level, the site may also be a site that the user is interested in (mathematical proof: (omitted). For example, if a user transfers to the subway, he may want to transfer to the subway at Xizhimen. If the result shows many subway lines that contain xuanwumen, so we assume that xuanwumen can also be a good subway transfer solution.
Problems to be Solved:
Please design a Data Structure and calculate it at the lowest possible time complexity and space complexity, the search result contains the sorting set of the same site (based on the number of times the results contain ). For example, if you enter Xizhimen, you can return the recommended xuanwumen. If there are other sites, the list is listed based on the number of occurrences.
The above questions are self-developed and are problems I encountered in the process of open-source iveely. I think this is a meaningful question, because not only our thinking, but also our code technology, of course, the most important thing is our mathematics. I will issue other similar questions one after another, so that we can discuss and learn them together. Welcome to your attention on iveely search engine, if you have any good comments or suggestions, you can mail liufanping@iveely.com or meager contact me.