The Dbscan algorithm based on clustering algorithm for general forum text extraction

Source: Internet
Author: User
Tags xpath

General Forum Body Crawl

This is the title of the fifth session of the Teddy Cup this year with the team-mates, although the final only won a third prize. But in the process with teammates also learned a lot of things, hereby record.

1. Simple Introduction

The purpose of the game is to let the contestants for any BBS type of Web pages, get their HTML text content, design an intelligent extraction of the page's main paste, all the replies algorithm.

Http://www.tipdm.org/jingsa/1030.jhtml?cName=ral_100#sHref the address of the game title.

2. Preliminary preparation

With no prior contact with the crawler, my teammates and I first learned about the current mainstream language and framework for reptiles, and eventually opted for a more user-friendly Python BS4 framework for beginners. Then you learn some simple Python basics for reptiles, regular expressions, url packs, and so on.

For the game, we first learned that the crawler is divided into static Web pages, dynamic Web pages and Web service, we only have a static web page for the study, the Dynamic Web page is more complex, because the time is more tense, there is no in-depth research, for some of the site's anti-grilled, nor in-depth understanding So the next main thing to say is how to design a generic static web crawler framework. (I think this is part of our losing points.)

Ideas:

For a common web site, we might use regular expressions to crawl what we want, but it's obviously a bit of an imposition to be versatile. First, we analyze the entire Web page structure is the DOM tree, then the DOM analysis, get the main paste node and the characteristics of the post node, the characteristics of similar web pages clustering, which clustering algorithm selected Dbscan (because he can be automatically divided into several categories, do not need artificial settings). Then form a unified template, which will reduce our workload.

3, the whole process

Based on the official 177 URLs, we crawled the URL of 736 forums on our own. Then use 736 Web pages for clustering, forming templates, using 177 URLs for testing.

The 736 URLs crawled are analyzed and the following results are obtained.


As you can see, most forum sites are written by open source frameworks and Discuz. However, different versions of the open source framework will have different structures, so you cannot use the same template.

Structure Similarity calculation:

First, we parse the Web page structure to get the XPath value of the main post node and the reply node.

The xpth characteristics of a single Web page can be expressed as:
Then the Dbscan clustering algorithm is used, where two Web page distances are defined as follows



It represents the number of features in page I, which represents the number of features in page J; overlap represents the number of identical features for two Web pages, and the more the number of identical features in two pages, the more the value of the formula (2) is approaching 0.
Note: Before clustering, the preprocessing of each XPath, such as numbers, symbols and other irrelevant features


Content Similarity calculation:
The main is to calculate the similarity of the URL.
, the second half of the URL is parsed.
Overall Web page similarity calculation:


The S1,S2 is a Web page or cluster center, is the weight of the feature I, is the similarity of the feature I. After the initial cluster is obtained by Dbscan clustering algorithm, the characteristic library is updated continuously according to the test data, which can update the weights dynamically and get better clustering effect.


Body Extraction Process


Through the URL and XPath template matching, can complete the Forum page identification and filtering, and then to the Forum body information identification and extraction. At the same time, we can see that when the test of different sites more and more, the XPath Library and template Library will be more and more rich, this is a continuous learning process.

Results of different parameters clustering:

E=0,minpts = 4

E=0,minpts =8

Cluster category

Proportion

Web page Category

Cluster category

Proportion

Web page Category

1

0.667

Discuz

1

0.705

Discuz

8

0.089

Non-open source

5

0.092

Phpwind

5

0.0278

Phpwind

2

0.041

Dvbbs

2

0.0222

Dvbbs

6

0.023

Non-open source

10

0.0222

Non-open source

10

0.023

Non-open source

E=1,minpts = 4

E=1,minpts = 8

Cluster category

Proportion

Web page Category

Cluster category

Proportion

Web page Category

1

0.630

Discuz

1

0.628

Discuz

3

0.205

Non-open source

3

0.129

Non-open source

9

0.123

Non-open source

2

0.087

Dvbbs

4

0.0871

Phpwind

4

0.051

Phpwind

2

0.051

Dvbbs

9

0.021

Non-open source





















Number of clusters obtained from different parameters:

number of clusters obtained from different parameters:

Parameters

e=0,minpts = 4

e=0,minpts =8

e=1,minpts = 4

e=1,minpts = 8

Number of clusters

A

A

-

-

total number of forums in clusters

173

173

194

194

off-Group Point

A

A

Ten

Ten

 

Test Results:

Forum website

Test Posts

successfully extracted

guba.sina.com.cn

of

of

club.autohome.com.cn

One

One

club.qingdaonews.com

9

9

bbs.tianya.cn

8

8

bbs.360.cn

5

5

bbs1.people.com.cn

5

0

bbs.pcauto.com.cn

5

5

bbs.dospy.com

4

5

bbs.hsw.cn

4

4

itbbs.pconline.com.cn

4

4

www.dddzs.com

4

4

bbs.hupu.com

4

4

bbs.ent.qq.com

3

0

bbs.e23.cn

3

3

bbs.lady.163.com

1

0

www.099t.com

1

0

Partial extraction results:

Conclusion: Using the method is more traditional, can only do most of the forum extraction, but with the accumulation of quantity, the effect is better. There is no use to compare the fire in NLP now (there should be a homecoming) and no excessive filtering of the results. Only on the text and post time, master and subordinate paste to subdivide, the post people did not get an effective solution. There are a lot of places to learn. If there is a mistake, please correct me.

Dbscan Code:

#encoding: Utf-8 ' Created on April 12, 2017 ' "from collections import defaultdict import Re" ' function to Calculat E distance use define formula, (Len (i) *len (j) +1)/(overlap*overlap+1)-1 parameter Url1{url,xpath,feanum} url2{url,xpath , Feanum} split/t maybe have counter with/table ' Def dist (URL1, url2): Values1=url1.split (' \ t ') Values2=url 2.split (' \ t ') #得到xpath xpath_val1=values1[1][2:].split ('/') xpath_val2=values2[1][2:].split ('/') #得到两个xpat h a size with the least number of features = Len (xpath_val1) If Len (Xpath_val1) < Len (xpath_val2) Else Len (xpath_val2) #得到overlap over Lap=0 for I in range (size): X1=re.sub (R ' \[+\] ', ', Re.sub (R ' (\d+) ', ', Xpath_val1[i]) X2=re.sub (r ' \[+\] ', ', ' Re.sub (R ' (\d+) ', ', xpath_val2[i]) ' if (x1==x2): Overlap+=1 return (len (XPATH_VAL1) *l En (XPATH_VAL2) +1)/(overlap**2+1)-1) #将所有的样本装入 all_points def init_sample (path): all_points=[] lines = Open (path) for I in lines:   
        A=[] A.append (i) all_points.append (a) return all_points all_points=init_sample ('. /..  
/train_bbs_urls.txt ') ' Take radius = 8 and min.points = 8 ' ' E = 0 minpts = 8 #find out the core points Other_points =[] core_points=[] plotted_points=[to point in All_points:point.append (0) # Assign initial L Evel 0 Total = 0 for otherpoint in all_points:distance = Dist (otherpoint[0],point[0]) if di Stance<=e:total+=1 if Total > MinPts:core_points.append (point) Plotte D_points.append (point) else:other_points.append (point) #find border points border_points=[] for co Re in Core_points:for-other_points:if Dist (core[0],other[0)) <=e:border_points  . Append (Other) Plotted_points.append, Other_points.remove (other) #implement the Algorithm cluster_label=0 print len (core_points) a=0 for Point in Core_points:if Point[1]==0:cluster_label +=1  
        Point[1]=cluster_label for point2 in plotted_points:distance = Dist (point2[0],point[0]) If point2[1] ==0 and distance<=e: # Print (point, Point2) point2[1] =point[1] F Or I in Plotted_points:print i[0], ', i[1] output=i[0].replace (' \ n ', ' ') + ' \ t ' +str (i[1]). Strip () Open (' Dbsca N.txt ', ' A + '). Write (' \ n ' +output.encode (' utf-8 ') #after the points are asssigned correnponding labels, we group t Hem cluster_list = {} for point in Plotted_points:va=point[0].split (' \ t ') start=va[0].find ('//') stop=va[ 0].find ('/', start+2) name=va[0][start+2:stop] If name not in Cluster_list:cluster_list[name] =point[1 ] # Else: # Core=cluster_list.get (Point[1]). Split (' \ t ') # if Name!=core[len (CORE)-1]: # CL USTER_LIST[POINT[1]] =clUster_list.get (point[1]) + ' \ t ' +name other_list = {} for point in Other_points:print ' aaaa ' va=point[0].split (' \ t ') Start=va[0].find ('//') Stop=va[0].find ('/', start+2) name=va[0][start+2:stop] If name not in other_l     Ist:print name Other_list[name] =point[1] # for I in Cluster_list.keys (): # print ' i= ', I # Output=str (i) + ' t ' +str (Cluster_list.get (i)) # Print Output # open (' Dbscantype.txt ', ' + '). Write (' \ n ' +output.enco     De (' Utf-8 ') # for I in Other_list.keys (): # print ' i= ', I # OUTPUT=STR (i) + ' t ' +str (Cluster_list.get (i)) # 
 Print output # open (' Other_list.txt ', ' + '). Write (' \ n ' +output.encode (' utf-8 ')








Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.