The Dbscan algorithm based on clustering algorithm for general forum text extraction

Last Update:2018-07-28 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

General Forum Body Crawl

This is the title of the fifth session of the Teddy Cup this year with the team-mates, although the final only won a third prize. But in the process with teammates also learned a lot of things, hereby record.

1. Simple Introduction

The purpose of the game is to let the contestants for any BBS type of Web pages, get their HTML text content, design an intelligent extraction of the page's main paste, all the replies algorithm.

Http://www.tipdm.org/jingsa/1030.jhtml?cName=ral_100#sHref the address of the game title.

2. Preliminary preparation

With no prior contact with the crawler, my teammates and I first learned about the current mainstream language and framework for reptiles, and eventually opted for a more user-friendly Python BS4 framework for beginners. Then you learn some simple Python basics for reptiles, regular expressions, url packs, and so on.

For the game, we first learned that the crawler is divided into static Web pages, dynamic Web pages and Web service, we only have a static web page for the study, the Dynamic Web page is more complex, because the time is more tense, there is no in-depth research, for some of the site's anti-grilled, nor in-depth understanding So the next main thing to say is how to design a generic static web crawler framework. (I think this is part of our losing points.)

Ideas:

For a common web site, we might use regular expressions to crawl what we want, but it's obviously a bit of an imposition to be versatile. First, we analyze the entire Web page structure is the DOM tree, then the DOM analysis, get the main paste node and the characteristics of the post node, the characteristics of similar web pages clustering, which clustering algorithm selected Dbscan (because he can be automatically divided into several categories, do not need artificial settings). Then form a unified template, which will reduce our workload.

3, the whole process

Based on the official 177 URLs, we crawled the URL of 736 forums on our own. Then use 736 Web pages for clustering, forming templates, using 177 URLs for testing.

The 736 URLs crawled are analyzed and the following results are obtained.

As you can see, most forum sites are written by open source frameworks and Discuz. However, different versions of the open source framework will have different structures, so you cannot use the same template.

Structure Similarity calculation:

First, we parse the Web page structure to get the XPath value of the main post node and the reply node.

The xpth characteristics of a single Web page can be expressed as:
Then the Dbscan clustering algorithm is used, where two Web page distances are defined as follows

It represents the number of features in page I, which represents the number of features in page J; overlap represents the number of identical features for two Web pages, and the more the number of identical features in two pages, the more the value of the formula (2) is approaching 0.
Note: Before clustering, the preprocessing of each XPath, such as numbers, symbols and other irrelevant features

Content Similarity calculation:
The main is to calculate the similarity of the URL.
, the second half of the URL is parsed.
Overall Web page similarity calculation:

The S1,S2 is a Web page or cluster center, is the weight of the feature I, is the similarity of the feature I. After the initial cluster is obtained by Dbscan clustering algorithm, the characteristic library is updated continuously according to the test data, which can update the weights dynamically and get better clustering effect.

Body Extraction Process

Through the URL and XPath template matching, can complete the Forum page identification and filtering, and then to the Forum body information identification and extraction. At the same time, we can see that when the test of different sites more and more, the XPath Library and template Library will be more and more rich, this is a continuous learning process.

Results of different parameters clustering:

E=0,minpts = 4	E=0,minpts =8
Cluster category	Proportion	Web page Category	Cluster category	Proportion	Web page Category
1	0.667	Discuz	1	0.705	Discuz
8	0.089	Non-open source	5	0.092	Phpwind
5	0.0278	Phpwind	2	0.041	Dvbbs
2	0.0222	Dvbbs	6	0.023	Non-open source
10	0.0222	Non-open source	10	0.023	Non-open source
E=1,minpts = 4	E=1,minpts = 8
Cluster category	Proportion	Web page Category	Cluster category	Proportion	Web page Category
1	0.630	Discuz	1	0.628	Discuz
3	0.205	Non-open source	3	0.129	Non-open source
9	0.123	Non-open source	2	0.087	Dvbbs
4	0.0871	Phpwind	4	0.051	Phpwind
2	0.051	Dvbbs	9	0.021	Non-open source

Number of clusters obtained from different parameters:

number of clusters obtained from different parameters:

Parameters	e=0,minpts = 4	e=0,minpts =8	e=1,minpts = 4	e=1,minpts = 8
Number of clusters	A	A	-	-
total number of forums in clusters	173	173	194	194
off-Group Point	A	A	Ten	Ten

Test Results:

Forum website	Test Posts	successfully extracted
guba.sina.com.cn	of	of
club.autohome.com.cn	One	One
club.qingdaonews.com	9	9
bbs.tianya.cn	8	8
bbs.360.cn	5	5
bbs1.people.com.cn	5	0
bbs.pcauto.com.cn	5	5
bbs.dospy.com	4	5
bbs.hsw.cn	4	4
itbbs.pconline.com.cn	4	4
www.dddzs.com	4	4
bbs.hupu.com	4	4
bbs.ent.qq.com	3	0
bbs.e23.cn	3	3
bbs.lady.163.com	1	0
www.099t.com	1	0

Partial extraction results:

Conclusion: Using the method is more traditional, can only do most of the forum extraction, but with the accumulation of quantity, the effect is better. There is no use to compare the fire in NLP now (there should be a homecoming) and no excessive filtering of the results. Only on the text and post time, master and subordinate paste to subdivide, the post people did not get an effective solution. There are a lot of places to learn. If there is a mistake, please correct me.

Dbscan Code:

#encoding: Utf-8 ' Created on April 12, 2017 ' "from collections import defaultdict import Re" ' function to Calculat E distance use define formula, (Len (i) *len (j) +1)/(overlap*overlap+1)-1 parameter Url1{url,xpath,feanum} url2{url,xpath , Feanum} split/t maybe have counter with/table ' Def dist (URL1, url2): Values1=url1.split (' \ t ') Values2=url 2.split (' \ t ') #得到xpath xpath_val1=values1[1][2:].split ('/') xpath_val2=values2[1][2:].split ('/') #得到两个xpat h a size with the least number of features = Len (xpath_val1) If Len (Xpath_val1) < Len (xpath_val2) Else Len (xpath_val2) #得到overlap over Lap=0 for I in range (size): X1=re.sub (R ' \[+\] ', ', Re.sub (R ' (\d+) ', ', Xpath_val1[i]) X2=re.sub (r ' \[+\] ', ', ' Re.sub (R ' (\d+) ', ', xpath_val2[i]) ' if (x1==x2): Overlap+=1 return (len (XPATH_VAL1) *l En (XPATH_VAL2) +1)/(overlap**2+1)-1) #将所有的样本装入 all_points def init_sample (path): all_points=[] lines = Open (path) for I in lines:   
        A=[] A.append (i) all_points.append (a) return all_points all_points=init_sample ('. /..  
/train_bbs_urls.txt ') ' Take radius = 8 and min.points = 8 ' ' E = 0 minpts = 8 #find out the core points Other_points =[] core_points=[] plotted_points=[to point in All_points:point.append (0) # Assign initial L Evel 0 Total = 0 for otherpoint in all_points:distance = Dist (otherpoint[0],point[0]) if di Stance<=e:total+=1 if Total > MinPts:core_points.append (point) Plotte D_points.append (point) else:other_points.append (point) #find border points border_points=[] for co Re in Core_points:for-other_points:if Dist (core[0],other[0)) <=e:border_points  . Append (Other) Plotted_points.append, Other_points.remove (other) #implement the Algorithm cluster_label=0 print len (core_points) a=0 for Point in Core_points:if Point[1]==0:cluster_label +=1  
        Point[1]=cluster_label for point2 in plotted_points:distance = Dist (point2[0],point[0]) If point2[1] ==0 and distance<=e: # Print (point, Point2) point2[1] =point[1] F Or I in Plotted_points:print i[0], ', i[1] output=i[0].replace (' \ n ', ' ') + ' \ t ' +str (i[1]). Strip () Open (' Dbsca N.txt ', ' A + '). Write (' \ n ' +output.encode (' utf-8 ') #after the points are asssigned correnponding labels, we group t Hem cluster_list = {} for point in Plotted_points:va=point[0].split (' \ t ') start=va[0].find ('//') stop=va[ 0].find ('/', start+2) name=va[0][start+2:stop] If name not in Cluster_list:cluster_list[name] =point[1 ] # Else: # Core=cluster_list.get (Point[1]). Split (' \ t ') # if Name!=core[len (CORE)-1]: # CL USTER_LIST[POINT[1]] =clUster_list.get (point[1]) + ' \ t ' +name other_list = {} for point in Other_points:print ' aaaa ' va=point[0].split (' \ t ') Start=va[0].find ('//') Stop=va[0].find ('/', start+2) name=va[0][start+2:stop] If name not in other_l     Ist:print name Other_list[name] =point[1] # for I in Cluster_list.keys (): # print ' i= ', I # Output=str (i) + ' t ' +str (Cluster_list.get (i)) # Print Output # open (' Dbscantype.txt ', ' + '). Write (' \ n ' +output.enco     De (' Utf-8 ') # for I in Other_list.keys (): # print ' i= ', I # OUTPUT=STR (i) + ' t ' +str (Cluster_list.get (i)) # 
 Print output # open (' Other_list.txt ', ' + '). Write (' \ n ' +output.encode (' utf-8 ')

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More