Background
To do IP address attribution to query, the volume is large, so I want to find most of the distribution data from the Internet, write a spider program to crawl the storage, in the process of the operation of continuous maintenance, update, perfect.
Some key points
- Goroutine is used to run the program in parallel.
- The use of the regular expression grouping information extraction, the correct extraction of our attention information.
- Database BULK INSERT operation.
- Database Bulk Update operation.
Code parsing
Description of core code by function module
Ip.go
The main process to implement Goroutine calls.
func main() { //利用go基本库封装的网页抓取函数,后面有说明 ctx := common.HttpGet("http://ips.chacuo.net/") //正则表达式,有两个分组(两组小括号),分别取城市信息与url,具体分析代码后面有说明 reg := regexp.MustCompile(`<li><a title="[\S]+" href='([^']+?)'>([^<]+?)</a></li>`) //取得页面上所有的城市及相应url ips := reg.FindAllStringSubmatch(string(ctx), -1) ch := make(chan string) //建立无缓冲字符串通道 for _, el := range ips { //一个协程处理一个具体页面 go ipSpider.SpiderOnPage(el[1], el[2], ch) } for range ips { //阻塞等待所有抓取工作全部完成 fmt.Println(<-ch) }}
Regular Expression Description
- The main process for all provinces have an entry page, get each province of the entrance assigned to a process to deal with, every entrance is like this
<a title="北京最新IP地址段" href="http://ips.chacuo.net/view/s_BJ">北京</a>
- Please note that there are only three parts in this change (title content, href content, link display content), two of which are the ones we need
- Title content corresponds to a regular
[\S]+ , non-whitespace character
- href content corresponding to the regular
([^']+?) , the first encounter single quote end, the question mark represents a non-greedy match, parentheses are grouped, it is convenient to remove the matching information
- Link display content corresponding to the regular
([^<]+?) , first encounter < end, the second group
- The Findallstringsubmatch function can take out all sub-groupings, the sub-groups start with subscript 1, and 0 is the string that matches the whole
Goroutine process
- Creates a buffer-free string channel that serves as the communication channel for all the threads and the main process
- Loop regular match results, assigning a co-process to each province's page
- The process obtains the data successfully and writes the database in bulk, returning success information to the channel
- The process fails to reverse the failure message to the channel
- The main process is blocked, and all the threads are returned with success or failure, and print success or failure information
Get IP address information
Similar to the main process, note that no information is processed.
Ipspider.go
//获取页面数据 ctx := common.HttpGet(url) //reg := regexp.MustCompile(`<li><a title="[\S]+" href='([^']+?)'>([^<]+?)</a></li>`) //两个分组分别对应IP段开始与结束 reg := regexp.MustCompile(`<dd><span class="v_l">([^<]+?)</span><span class="v_r">([^<]+?)</span><div class="clearfix"></div></dd>`) //<dd><span class="v_l">49.64.0.0</span><span class="v_r">49.95.255.255</span><div class="clearfix"></div></dd> //取得所有匹配的分组信息 ip := reg.FindAllStringSubmatch(string(ctx), -1) //没有取得任何信息,提前返回,很重要,不然主进程会一直等待结束不了 if len(ip) == 0 { ch <- "There are no data exist." return nil }
database table Structure Generation statement
CREATE TABLE `ip_addr_info` ( `id` int(11) NOT NULL AUTO_INCREMENT COMMENT '索引,自动增长', `ip_addr_begin` varchar(32) NOT NULL DEFAULT '' COMMENT 'ip地址段开始', `ip_addr_end` varchar(32) DEFAULT '' COMMENT 'ip地址段结束', `province` varchar(32) DEFAULT '' COMMENT '所属省', `ip_comp` varchar(32) DEFAULT '' COMMENT '运营商', PRIMARY KEY (`id`), UNIQUE KEY `ip_addr` (`ip_addr_begin`,`ip_addr_end`)) ENGINE=InnoDB AUTO_INCREMENT=7268 DEFAULT CHARSET=utf8 COMMENT='表';
Bulk Write to Database
Loop processing fetch data, generate batch write statements and input parameters, please check out to AFC9EBD version.
var vs [] interface{}//The interface array that stores input parameters var VSS string//input parameter placeholder string to be spliced for _, el: = Range IP { Handle all data vs = Append (VS, el[1], el[2], province)//Each column includes start address, end address and province VSS + = "(?,?,?),"//Placeholder Character} VSS = Vss[0:len (VSS)-1]//Remove the last comma var configs interface{}//From the configuration file fetch database information fr, err: = OS. Open ("./configs.json")//configuration file contents Please refer to the previous article "Golang Implement MySQL database backup" If err! = Nil {ch <-err. Error () Return err} decoder: = JSON. Newdecoder (FR) Err = decoder. Decode (&configs) Confs: = configs. (map[string]interface{}) Dialect: = confs["Database_dialect"]. (string) dbconf: = confs["Db_" +dialect+ "_config"]. (map[string]interface{}) Dbhost: = dbconf["Db_host"]. (string) Dbport: = StrConv. Formatfloat (dbconf["Db_port"]. ( float64), ' F ',-1, +) DbUser: = dbconf["Db_user"]. (string) Dbpass: = dbconf["Db_pass"]. (string) DbName: = dbconf["Db_name"]. (string) Dbcharset: = dbconf["Db_charset"]. (String) DAO, err: = MySQL. Open (dialect, DbUser + ":" +dbpass+ "@tcp (" +dbhost+ ":" +dbport+ ")/" +dbname+ "? charset=" +dbcharset "defer DAO. Close () if err! = Nil {ch <-err. Error () return err}//BULK INSERT statement stitching SQLSTR: = "INSERT into Ip_addr_info (ip_addr_begin,ip_addr_end,province) v Alues "+ VSS stmt, err: = DAO. Prepare (SQLSTR)//Preprocessing SQL statement with parameters RS, err: = stmt. Exec (vs ...) Execute SQL statement with parameters if err! = Nil {//error, return error message CH <-err. Error () return err}else {//success, return success information Affect, _: = Rs. rowsaffected () ch <-"province:" + Province + ", affect:" + StrConv. Formatint (affect, ten) return nil}
Batch Modify Database
The Ip_comp field in the database, is representative of the operator information, need to access from the operator page for data acquisition, just change the portal URL re-run the program can be correctly crawled, but then the storage is not new, but update, please check out to 4729e66 version.
//前提数据库表定义要设定唯一索引,主键或其它定义的unique索引 ... sqlstr := "insert into ip_addr_info (ip_addr_begin,ip_addr_end,ip_comp) values " + vss + //提供更新(唯一索引冲突时)时要对应原字段与值 " ON DUPLICATE KEY UPDATE ip_addr_begin = values(ip_addr_begin), ip_addr_end = values(ip_addr_end), ip_comp = values(ip_comp)" stmt, err := dao.Prepare(sqlstr) rs, err := stmt.Exec(vs...) if err != nil { ch <- err.Error() return err }else { affect, _ := rs.RowsAffected() ch <- "Province: " + province + ", affect: " + strconv.FormatInt(affect, 10) return nil }
Areas to be improved
The entry URL refers to the configuration, using the policy mode, so that the matching rules are abstracted into a strategy, the goal is not to change the program, adjust the configuration file can crawl different pages.
Project Address
https://github.com/zhoutk/goTools
How to use
git clone https://github.com/zhoutk/goToolscd goToolsgo getgo run ip.gogo buid ip.go./ip
Summary
Familiar with the Golang language, understand a new concurrency programming model, familiar with the specific database operation methods, to create a convenient tool for themselves.