How to implement a small web search engine (c#+sql server Full-text search +asp.net)

Source: Internet
Author: User
Tags contains create index implement sql query range thread
asp.net|server|sql|web| Full-text Search | search engine

1 Introduction

21st century, China's Internet search engine domain can be described as crowded, Baidu, Yahoo, search, Sogou and so on have resorted to the attention of netizens. These big sites can be said to have their own strengths, overall although their search function is very strong, but the results of the search is basically the same, information redundancy is very large, netizens have to spend time and again in the page, or to think of a good keyword and racked their brains. It would be nice to have a targeted search engine, and it should be controllable.

The origin of the 2 Soso

I have had such a painful experience in the process of searching the Internet. Because I like programming, often need to surf the internet to find technical information. But the results of the search is often from some obscure small sites, they are mainly reproduced csdn, Sadie Network and other technical major stations on the document, hateful is, these small sites are often reproduced incomplete, advertising piles, Luantan windows, but also the virus, Trojans and so on. I thought, if there is a "It Technology document search engine" is good, but did not, so do it yourself. I put csdn, IT168, Sadie Network and other IT technology sites stored in the "searched site library", according to a certain period of time to start the Spider (spider) Program (spider behind the principle), Spider found the results in a certain format temporarily on the hard disk, The carrier (Porter) program is then asynchronously transferred to the database, and then using SQL Server's powerful Full-text search (not using like statement J) to make a query interface with ASP.net, this has a Soso prototype. Because Soso only search a specific site, the number of small, so the speed of data updates faster, and because of the site prior to screening, the results of the search is relatively high quality, than with a large search engine has a better user experience. Later, I think of the School Network center of the teacher had proposed to do a search for all the website information of the search engine, then did the scanner program, its function is to retrieve a given IP range of all the sites, and the main information of these sites into the "searched site library." So there is "the Chinese teacher's own web search engine------Mysoso", the Web site http://it.ccnu.edu.cn/mysoso. After the launch of the site by the students praise, school leaders also in the school Network construction work Conference on the name of praise. A classmate said: "Previously wanted to check the student union president of the data, Google and Baidu search results are not ideal, because the duplicate name too much." With the Mysoso is much better, search to the Web page is also from the school's major websites, real and reliable. ”

Technical description of 3 Soso

Working environment of 3.1 Soso

Software Environment: Windows platform (recommended win2000,win2003 server) +.net1.1 Framework + SQL Server 2000. Hardware environment: Server One, the higher the configuration the better. Of course there are multiple servers better, so spider can run on multiple machines in parallel.

The basic principle of 3.2 soso

Soso mainly consists of five parts, database +webscanner+webspider+carrier+asp.net website.

The database mainly has three tables: the Site table, Web page, List of keywords. The site was searched to store the web site spider to visit the website and other basic information, Web tables to store the basic information of the pages found, keywords table records the user has retrieved the keyword and its frequency. Some stored procedures are also stored in the database for other modules to invoke. Also, because of the full-text retrieval capabilities of SQL Server, you will create index files.

Webscanner is a console-based application written in C # that scans the basic information of all Web sites of a certain IP range and stores them in a database. Because of the use of multithreading technology, scanning is relatively fast. After testing, scanning the division's IP range 202.114.32.1~202.114.47.255, 89 sites were only used for 45 seconds.

Webspider is a C # written in the console based application, its role is to access the database given the site, and the Web site crawl down, the principle of grasping the use of regular expressions (can adapt to various web pages), notes the design of a webpage class, It can get all the links, inbound links, link text, plain text, page size, title, and so on for a given web site. The Web page information data is put into a global data queue structure in memory, and the global data queue is serialized as a file on the hard disk in a certain period of time, and empties itself. Webspider internal use of multithreading technology, each thread to maintain their breadth first traversal queue, so the speed is very fast, through the Chinese teacher in school testing, average crawl 1,050 pages per minute. In addition, you can set parameters such as maximum concurrent threads, thread lifetime, search depth, data serialization cycle, specific site filtering, and so on in the configuration file (spider).

Carrier is a batch file that functions to "move" the serialized data of the Webspider output from the hard disk to the database. So why Webspider not directly to the data in the database to insert it? Because SQL Server receives a large number of data insertion requests, the efficiency is reduced, the query efficiency of the foreground asp.net web site is reduced, the query time becomes longer. Therefore, the author used the asynchronous mode in the design, Webspider only responsible for collecting data, carrier to be responsible for data insertion database, so through reasonable time scheduling can avoid bottlenecks. This asynchronous mode of operation is more obvious when there are multiple computers running Webspider.

Search site is developed with ASP.net, the basic principle that people who have done the site know. The highlights of the Soso are in three places. The first is the processing of word segmentation. Because the author is not deep in this aspect algorithm, therefore uses the split () plus the SQL Server FREETEXT function to implement the fuzzy query. The rationale is this, for example, when users query "Andy Lau King MP 3 "When the first with the logic that contains" Andy Lau and MP 3 "query, if no records, then use or logical query that is contains" Andy Lau King or MP 3. If there is still no record, then use SQL Server's freetext to do participle, may return match "Andy Lau", "King", "MP 3" any one of the word records, and according to the degree of relevance of the ranking. Second, this site's pagination algorithm uses the "on-demand" principle, that is, each time only from the database to read the article m to the m+pagesize-1 data, so the query speed is quite considerable. Keyword coloring has also done some tricks, the previous search of the keyword coloring regular formula will be precompiled existence application global variables, so others search the word when the speed is very fast. Finally to mention the page on the right side of the column, is currently on the campus news, it is essentially read an RSS feed and display. This RSS feed is another system that collects news from the five major portals and displays it in XML.

Sunjoy ccnusjy@gmail.com, Department of Information Technology, Huazhong Normal University



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.