General Forum Body Crawl
This is the title of the fifth session of the Teddy Cup this year with the team-mates, although the final only won a third prize. But in the process with teammates also learned a lot of things, hereby record.
1. Simple Introduction
The purpose of the game is to let the contestants for any BBS type of Web pages, get their HTML text content, design an intelligent extraction of the page's main paste, all the replies algorithm.
Http://www.tipdm.org/jingsa/1030.jhtml?cName=ral_100#sHref the address of the game title.
2. Preliminary preparation
With no prior contact with the crawler, my teammates and I first learned about the current mainstream language and framework for reptiles, and eventually opted for a more user-friendly Python BS4 framework for beginners. Then you learn some simple Python basics for reptiles, regular expressions, url packs, and so on.
For the game, we first learned that the crawler is divided into static Web pages, dynamic Web pages and Web service, we only have a static web page for the study, the Dynamic Web page is more complex, because the time is more tense, there is no in-depth research, for some of the site's anti-grilled, nor in-depth understanding So the next main thing to say is how to design a generic static web crawler framework. (I think this is part of our losing points.)
Ideas:
For a common web site, we might use regular expressions to crawl what we want, but it's obviously a bit of an imposition to be versatile. First, we analyze the entire Web page structure is the DOM tree, then the DOM analysis, get the main paste node and the characteristics of the post node, the characteristics of similar web pages clustering, which clustering algorithm selected Dbscan (because he can be automatically divided into several categories, do not need artificial settings). Then form a unified template, which will reduce our workload.
3, the whole process
Based on the official 177 URLs, we crawled the URL of 736 forums on our own. Then use 736 Web pages for clustering, forming templates, using 177 URLs for testing.
The 736 URLs crawled are analyzed and the following results are obtained.
As you can see, most forum sites are written by open source frameworks and Discuz. However, different versions of the open source framework will have different structures, so you cannot use the same template.
Structure Similarity calculation:
First, we parse the Web page structure to get the XPath value of the main post node and the reply node.
The xpth characteristics of a single Web page can be expressed as:
Then the Dbscan clustering algorithm is used, where two Web page distances are defined as follows
It represents the number of features in page I, which represents the number of features in page J; overlap represents the number of identical features for two Web pages, and the more the number of identical features in two pages, the more the value of the formula (2) is approaching 0.
Note: Before clustering, the preprocessing of each XPath, such as numbers, symbols and other irrelevant features
Content Similarity calculation:
The main is to calculate the similarity of the URL.
, the second half of the URL is parsed.
Overall Web page similarity calculation:
The S1,S2 is a Web page or cluster center, is the weight of the feature I, is the similarity of the feature I. After the initial cluster is obtained by Dbscan clustering algorithm, the characteristic library is updated continuously according to the test data, which can update the weights dynamically and get better clustering effect.
Body Extraction Process
Through the URL and XPath template matching, can complete the Forum page identification and filtering, and then to the Forum body information identification and extraction. At the same time, we can see that when the test of different sites more and more, the XPath Library and template Library will be more and more rich, this is a continuous learning process.
Results of different parameters clustering:
E=0,minpts = 4 |
E=0,minpts =8 |
Cluster category |
Proportion |
Web page Category |
Cluster category |
Proportion |
Web page Category |
1 |
0.667 |
Discuz |
1 |
0.705 |
Discuz |
8 |
0.089 |
Non-open source |
5 |
0.092 |
Phpwind |
5 |
0.0278 |
Phpwind |
2 |
0.041 |
Dvbbs |
2 |
0.0222 |
Dvbbs |
6 |
0.023 |
Non-open source |
10 |
0.0222 |
Non-open source |
10 |
0.023 |
Non-open source |
E=1,minpts = 4 |
E=1,minpts = 8 |
Cluster category |
Proportion |
Web page Category |
Cluster category |
Proportion |
Web page Category |
1 |
0.630 |
Discuz |
1 |
0.628 |
Discuz |
3 |
0.205 |
Non-open source |
3 |
0.129 |
Non-open source |
9 |
0.123 |
Non-open source |
2 |
0.087 |
Dvbbs |
4 |
0.0871 |
Phpwind |
4 |
0.051 |
Phpwind |
2 |
0.051 |
Dvbbs |
9 |
0.021 |
Non-open source |
Number of clusters obtained from different parameters:
number of clusters obtained from different parameters:
Parameters |
e=0,minpts = 4 |
e=0,minpts =8 |
e=1,minpts = 4 |
e=1,minpts = 8 |
Number of clusters |
A |
A |
- |
- |
total number of forums in clusters |
173 |
173 |
194 |
194 |
off-Group Point |
A |
A |
Ten |
Ten |
Test Results:
Forum website |
Test Posts |
successfully extracted |
guba.sina.com.cn |
of |
of |
club.autohome.com.cn |
One |
One |
club.qingdaonews.com |
9 |
9 |
bbs.tianya.cn |
8 |
8 |
bbs.360.cn |
5 |
5 |
bbs1.people.com.cn |
5 |
0 |
bbs.pcauto.com.cn |
5 |
5 |
bbs.dospy.com |
4 |
5 |
bbs.hsw.cn |
4 |
4 |
itbbs.pconline.com.cn |
4 |
4 |
www.dddzs.com |
4 |
4 |
bbs.hupu.com |
4 |
4 |
bbs.ent.qq.com |
3 |
0 |
bbs.e23.cn |
3 |
3 |
bbs.lady.163.com |
1 |
0 |
www.099t.com |
1 |
0 |
Partial extraction results:
Conclusion: Using the method is more traditional, can only do most of the forum extraction, but with the accumulation of quantity, the effect is better. There is no use to compare the fire in NLP now (there should be a homecoming) and no excessive filtering of the results. Only on the text and post time, master and subordinate paste to subdivide, the post people did not get an effective solution. There are a lot of places to learn. If there is a mistake, please correct me.
Dbscan Code:
#encoding: Utf-8 ' Created on April 12, 2017 ' "from collections import defaultdict import Re" ' function to Calculat E distance use define formula, (Len (i) *len (j) +1)/(overlap*overlap+1)-1 parameter Url1{url,xpath,feanum} url2{url,xpath , Feanum} split/t maybe have counter with/table ' Def dist (URL1, url2): Values1=url1.split (' \ t ') Values2=url 2.split (' \ t ') #得到xpath xpath_val1=values1[1][2:].split ('/') xpath_val2=values2[1][2:].split ('/') #得到两个xpat h a size with the least number of features = Len (xpath_val1) If Len (Xpath_val1) < Len (xpath_val2) Else Len (xpath_val2) #得到overlap over Lap=0 for I in range (size): X1=re.sub (R ' \[+\] ', ', Re.sub (R ' (\d+) ', ', Xpath_val1[i]) X2=re.sub (r ' \[+\] ', ', ' Re.sub (R ' (\d+) ', ', xpath_val2[i]) ' if (x1==x2): Overlap+=1 return (len (XPATH_VAL1) *l En (XPATH_VAL2) +1)/(overlap**2+1)-1) #将所有的样本装入 all_points def init_sample (path): all_points=[] lines = Open (path) for I in lines:
A=[] A.append (i) all_points.append (a) return all_points all_points=init_sample ('. /..
/train_bbs_urls.txt ') ' Take radius = 8 and min.points = 8 ' ' E = 0 minpts = 8 #find out the core points Other_points =[] core_points=[] plotted_points=[to point in All_points:point.append (0) # Assign initial L Evel 0 Total = 0 for otherpoint in all_points:distance = Dist (otherpoint[0],point[0]) if di Stance<=e:total+=1 if Total > MinPts:core_points.append (point) Plotte D_points.append (point) else:other_points.append (point) #find border points border_points=[] for co Re in Core_points:for-other_points:if Dist (core[0],other[0)) <=e:border_points . Append (Other) Plotted_points.append, Other_points.remove (other) #implement the Algorithm cluster_label=0 print len (core_points) a=0 for Point in Core_points:if Point[1]==0:cluster_label +=1
Point[1]=cluster_label for point2 in plotted_points:distance = Dist (point2[0],point[0]) If point2[1] ==0 and distance<=e: # Print (point, Point2) point2[1] =point[1] F Or I in Plotted_points:print i[0], ', i[1] output=i[0].replace (' \ n ', ' ') + ' \ t ' +str (i[1]). Strip () Open (' Dbsca N.txt ', ' A + '). Write (' \ n ' +output.encode (' utf-8 ') #after the points are asssigned correnponding labels, we group t Hem cluster_list = {} for point in Plotted_points:va=point[0].split (' \ t ') start=va[0].find ('//') stop=va[ 0].find ('/', start+2) name=va[0][start+2:stop] If name not in Cluster_list:cluster_list[name] =point[1 ] # Else: # Core=cluster_list.get (Point[1]). Split (' \ t ') # if Name!=core[len (CORE)-1]: # CL USTER_LIST[POINT[1]] =clUster_list.get (point[1]) + ' \ t ' +name other_list = {} for point in Other_points:print ' aaaa ' va=point[0].split (' \ t ') Start=va[0].find ('//') Stop=va[0].find ('/', start+2) name=va[0][start+2:stop] If name not in other_l Ist:print name Other_list[name] =point[1] # for I in Cluster_list.keys (): # print ' i= ', I # Output=str (i) + ' t ' +str (Cluster_list.get (i)) # Print Output # open (' Dbscantype.txt ', ' + '). Write (' \ n ' +output.enco De (' Utf-8 ') # for I in Other_list.keys (): # print ' i= ', I # OUTPUT=STR (i) + ' t ' +str (Cluster_list.get (i)) #
Print output # open (' Other_list.txt ', ' + '). Write (' \ n ' +output.encode (' utf-8 ')