Project Objective: Oschina a simple package framework for full-text search
License:public Domain
Content included:
Rebuild Index Tool, Indexrebuilder.java
Incremental build Index tool, Indexupdater.java
Full-Text Search framework
Http://git.oschina.net/oschina/search-framework
Tngoudb background
Tngoudb is a Chinese search engine database developed by Tengu Network (tngou.net) for agricultural search engine of tengu agricultural network. Tengu hopes to build Tngoudb into a dedicated Chinese indexed NoSQL database based on the power of open source.
Brief introduction
Tngoudb is a Java-based cross-platform database that uses Lucene (storage engine), IK (word breaker), Netty (communication), etc. to create a network database.
Tngoudb directly simplifies the invocation of Lucene's related API, using SQL statements to implement CRUD operations of data.
Characteristics
Tngoudb can be separated from the Lucene stand-alone now, through the network can be TNGOUDB deployed on a separate server, processed separately stored in the query business. Tngoudb with
When simplifying the complexity of SOLR, users can perform related data manipulation through simple SQL statements. Tngoudb can completely throw away the Lucene knowledge associated with SOLR and can be implemented with common SQL statements.
Document
Document Address: HTTP://WWW.TNGOU.NET/DOC/TNDB supports complete installation, configuration, and use of documentation.
Use case
Now TNGOUDB is the internal test version, please do not use for online projects! We will continue to develop and update, the late release of the corresponding official version.
Tngoudb is now used in the search business of Tengu net Tengu (Http://www.tngou.net/search)
Http://git.oschina.net/397713572/TngouDB
This project is the complete source code of the Beijing University search engine TSE (including index and crawler two separate project source), TSE for "search engine-principle, technology and system" introduction of the realization of the prototype, interested friends can refer to the book to learn TSE.
"Search engine-principle, technology and system" to provide source code http://sewm.pku.edu.cn/book/
Often can not access, here I will be the previous download learning to add the details of the source code to open, not only the source of the comments, there is a detailed study Notes--CSDN blog column address: http://blog.csdn.net/column/details/ Inside-tse.html, I hope to have some help for the beginner's friends.
Catalogue Description:
Tse081227--tse's Web Collection subsystem (crawler).
Index--tse preprocessing and querying service subsystem, the directory is very large, in fact, not because the source code is large, but because the index/data/tianwang.raw.2559638448 is very large, the file is crawling the original Web page data.
In addition, the original index/data/tianwang.raw.2559638448 file has more than 300 megabytes, upload the hint exceeded the maximum limit of git.oschina.net/file (100M), so the contents of the file deleted a lot, in order to get smaller files, This has no effect on the operation of the entire system, as it simply crawls the original Web page data, which can be much less.
Http://git.oschina.net/lewsn2008/LBTSE
GSO (Google so)
This is a Google search service written in node. js, the principle is to take the user's keywords to Google server search, and then respond to the results returned to the user. Google search agent written using Nodejs
View Demo Project home page
Description of the Certificate: The certificate provided in the file list is used only for testing and is replaced with your own certificate in the production environment
Deployment installation:
git clone https://git.oschina.net/lenbo/gso.gitcd gsonpm install--production
Run command: Test/debug:
npm start
Ornode ./bin/run
Production environment
Custom settings Site name
After setting the site name, it will be displayed in the browser title bar under the homepage logo. Modify the Conf/config.js file, locate the name node, and modify it to its own site name:
Name: ' Valley Search '
Statistics script
Paste the script into the Views/partials/statistics.ejs file
Homepage random Text
Paste text into Data/words.txt, with each sentence separated by a blank line, supporting HTML code
Set up multiple Google IPs to prevent blocking
Place the available IP into the Conf/ip.txt file, with each IP separated by a carriage return line break.
Setting up an HTTP proxy server
Sometimes, we may need to set up a proxy server, such as when Google's IP expires temporarily unavailable or blocked by Google. Modify the Conf/config.js file to locate the proxy node:
Proxy: {enable:false,//Set whether timeout:5000 is enabled,//set timeout, enable True when active Host: ',//proxy server address port: 80//Proxy server Port}
Static file compression
The code after clone is uncompressed and can be compressed using the grunt tool.
Compress js,css Files
To install the Grunt tool:npm install -g grunt-cli
Executing commands in the project root directory grunt static
Modify the R_prefix value in Conf/config.js to/public
Note: Installation dependencies must be used before executing the grunt command npm install
, notnpm install --production
HTML code compression
Set it up before NODE_ENV
production
you start the service, such asNODE_ENV=production forever start bin/run
Complete record
Added "related Search" function;
OpenSearch, support Ie,firefox,chrome set as default search engine;
Simple sensitive word detection, otherwise the connection will be the wall/connection reset;
HTML code compression, based on the Html-minifier module compression has been rendered good HTML code;
Headroom function (the search area disappears when the page scrolls down, and the search area appears again when the page scrolls up.) Personally feel this experience for small screen notebook and pad is better, especially mobile phone terminal);
Implement HTTPS function (keyword encryption);
Use Cheeio instead of jquery parsing;
The input box is completed automatically;
Search content language switch;
Filter results based on time period;
When searching with the filetype directive, the result item prefix displays filetype;
Support for setting up multiple Google IPs (2014-12-25);
Increased HTTP proxy functionality (2014-12-28);
Todo
[] pad display optimization, font optimization;
[] Optimize the use of mobile phone-side experience;
[] Support keyboard shortcuts;
[] Support Wikipedia search;
[] Optimization error logging;
[] supports video meta-information retrieval (simultaneous retrieval of playable sources)
[] Increase the online proxy function (some blocked websites appearing in proxy search results);
Http://git.oschina.net/lenbo/gso
Code was written a year ago, so the crawler may have failed, but on this basis to change should be OK.
K:\git\dianying\scripts>tree/f folder PATH List volume serial number is Ee77-ec45k:.│iqiyi_movie_test.py│letv_movie_test.py│m1905_movie _test.py│pps_movie_test.py│pptv_movie_test.py│qq_movie_test.py│sohu_movie_test.py│tudou_movie_test.py│xunlei_m Ovie_test.py│youku_movie_test.py│└─douban doubanapi_1.py doubanapi_2.py doubanapi_3.py Douba napi_xj.py douban_movie_test.py
Search Sites
Dianying_web.py supports hundreds of thousands of of records that are saved to MongoDB by the crawler in the form of a web-based display and supports keyword queries.
Http://git.oschina.net/awakenjoys/dianying
OSC search engine Framework SEARCH-FRAMEWORK,TNGOUDB,GSO,