recently got a nearly 1.2 billion of the global NS node data, originally wanted to do a nationwide DNS anti-search domain name and then nationwide site collection and scanning, and later found that the number of sites is not very accurate, coupled with a person's energy and financial resources is really difficult to complete such a huge task, Did not go on, leaving only the notes of the building.
text format, simple text search, the speed is too slow, a search close to spend 5-10 minutes, decided to pour it into the database for an optimization, speed should be able to improve, the computer only AMP environment, then decided to pour into MySQL,
First Use navicat to pour, just the format of the data is ip,ns such format, poured nearly 5 hours to find not pour into 1%, this is plain text format when the size of 54G data file Ah!
later found that with MySQL's own load data local infile only 30 minutes, the first time when the import forgot to create a new key, had to re-import once
mysql> Load Data local infile ' e:\\dns\\rite\\20141217-rdns.txt ' into table dnsfields terminated by ', '; Query OK, 1194674130 rows affected, 1700 warnings (min 26.65 sec) records:1194674130 deleted:0 skipped:0 warnings: 1700
because an ID field was added, the import speed was significantly reduced, but it took about 1.5 hours or so to complete the import of 55G data.
Next is to build the index, because I need the fuzzy query, so here is the full text+btree, it took almost 3 days to set up the index to complete, during the period because accidentally the MySQL execution window closed, thought it was so finished, Finally found that in fact, MySQL still in the background silently indexing.
after the index is established, it is faster to find the query speed than to have no index .
SELECT * FROM NS where NS like '%weibo.com '
It took 210 seconds, or too slow.
and then we started using Sphinx to do index lifting,
Download the 64-bit Sphinx MYSQL Support package from the official
then configure the configuration file, src Configure the account password to MySQL
source src1{ sql_host = localhost sql_user = root sql_pass = root sql_db = ns sql_port = 3306 sql_query = select id,ip,ns from ns //here to write the query statement sql_attr_uint = id
and then searchd also need to configure, port and log, PID file path configuration can be
searchd{ listen = 9312 listen = 9306:mysql41 log = e:/phpstudy/splinx/file/log.log query_log = e:/phpstudy/splinx/file/query.log pid_ File = e:/phpstudy/splinx/file/searchd.pid
Then switch to Sphinx's Bin directory for indexing, executing
Searchd test1 #test1是你source的名称
I probably built it up in less than 2 hours, and it was done,
then switch to the API directory to execute
E:\phpstudy\splinx\api>test.py asddeprecated: do not call this method or , even better, use sphinxql instead of anapiquery ' asd ' retrieved 1000 of 209273 matches in 0.007 secquery stats: ' ASD ' found 209291 times in 209273 documentsmatches:1. doc_id=20830, weight=12. doc_id=63547, weight=13. doc_id=96147, weight=14. doc_id=1717000, weight=15. doc_id=2213385, weight=16. doc_id=3916825 , weight=17. doc_id=3981791, weight=18. doc_id=5489598, weight=19. doc_id= 9348383, weight=110. doc_id=18194414, weight=111. doc_id=18194415, weight=112. doc_id=18195126, weight=113. doc_id=18195517, weight=114. doc_id=18195518, weight= 115. doc_id=18195519, weight=116. doc_id=18195520, weight=117. doc_id=18195781, weight=118. doc_id=18195782, weight=119. doc_id=18200301, weight=120. doc_id=18200303, weight=1
was tested and found that the speed was really fast and wrote a PHP script to call
<?phpinclude ' sphinxapi.php '; $conn =mysql_connect (' 127.0.0.1 ', ' root ', ' root '); mysql_select_db (' ns '), $conn), $sphinx = new sphinxclient (), $now =time () $sphinx->setserver ( ' 127.0.0.1 ', 9312 ), $result = $sphinx->query (' weibo.com ', ' test1 '); foreach ($result [' Matches '] as $key => $val) { $sql = "Select * from ns where id= ' {$key} ' "; $res =mysql_query ($sql); $ Res=mysql_fetch_array ($res); echo "{$res [' IP ']}:{$res [' NS ']}";} Echo time ()-$now;?
Basic implementation of the second check! , the last output time only spent 0!
123.125.104.176:w-176.service.weibo.com123.125.104.178:w-178.service.weibo.com123.125.104.179: w-179.service.weibo.com123.125.104.207:w-207.service.weibo.com123.125.104.208: W-208.SERVICE.WEIBO.COM123.125.104.209:W-209.SERVICE.WEIBO.COM123.125.104.210: w-210.service.weibo.com202.106.169.235:staff.weibo.com210.242.10.56:weibo.com.tw218.30.114.174: w114-174.service.weibo.com219.142.118.228:staff.weibo.com60.28.2.221:w-221.hao.weibo.com60.28.2.222: w-222.hao.weibo.com60.28.2.250:w-222.hao.weibo.com61.135.152.194:sina152-194.staff.weibo.com61.135.152.212: sina152-212.staff.weibo.com65.111.180.3:pr1.cn-weibo.com160.34.0.155: Srm-weibo.us2.cloud.oracle.com202.126.57.40:w1.weibo.vip.hk3.tvb.com202.126.57.41:w1.weibo.hk3.tvb.com0
sphinx+mysql+php 1.2 billion DNS data second-check