Large Data Learning Notes (ii)-Graded classified collection & column classified collection

Source: Internet
Author: User
Tags pow

The following data and code: click here
1. Get Data:
The number of words from all the articles in 100 blogs, the RSS feeds used here to get the article data, RSS is the XML file format, so you can download a feedparser to parse the XML document. For statistics on how to get each word of each blog, check the generatefeedvector.py file yourself. Do not understand where to welcome the exchange. The file obtained after obtaining is blogdata.txt.
2. Grading Classified Collection
The hierarchical classified collection is merged by continuously merging two of the most similar group 22. Tree graph is a kind of visualization method of hierarchical classified collection.
Here's how to classified collection The blog data we generated above.
① Load Data

#加载数据集, put the blog title into Rownames, the number of words in the two-bit list of data, put the word name into colnames
def readfile (filename): Lines=[line for line in
    File (filename)]

    #第一行是列标题
    colnames=lines[0].strip (). Split (' \ t ') [1:]
    rownames=[]
    data=[]
    for Line in Lines[1:]:
        P=line.strip (). Split (' \ t ')
        #每行的第一列是行名
        rownames.append (p[0])
        #剩余部分就是该行对应的数据
        data.append ([Float (x) for x in p[1:]])
    return  Rownames,colnames,data

②. Here we need to ask two blog words of the relevance of the situation, before the blog to learn three ways to find relevance, here because some blogs contain more articles than other blogs, so using Pearson correlation, where the correlation of two blogs is a row of data, we need to make a small change to the Pearson correlation algorithm. ,

#v1, v2data the middle finger is different two rows of data
def Pearson (V1,V2):
    sum1=sum (v1)
    sum2=sum (v2)

    sum1sq=sum ([Pow (v,2) for V in V1 ])
    sum2sq=sum ([Pow (v,2) for V in v2])

    psum=sum ([v1[i]*v2[i] for I in range (len (v1))]

    num=psum-(sum1*sum2/ Len (v1))
    den=sqrt (Sum1sq-pow (sum1,2)/len (v1)) * (Sum2sq-pow (sum2,2)/len (v1))
    if den==0: return
        0
    #因为皮尔逊相关度返回的是0-1 floating-point number, and the bigger the two correlations, the higher the distance we need here, so we use 1.0-num/den to express the distance return
    1.0-num/den

③. Establish classified collection point model,
Each point has a row of data sets Vec, and the point is aggregated from two points to left and right, the similarity between the two points is distance, and the ID of the point. So create the following classes

Class Bicluster:
    #vec表示该博客对应的点的向量, (that is, a row of data inside the database), left and right indicate which two points the point is aggregated from, distance is the distance between two points, ID is the number
    of points def __init__ (self,vec,left=none,right=none,distance=0.0,id=none):
        self.left=left
        self.right=right
        Self.vec=vec
        self.id=id
        self.distance=distance

④. Recursive merging Clustering
Here is the point of a last big cluster.
Details of their own understanding, not understanding of the mutual exchange,

def hcluster (Rows,distance=pearson): distances={} currentclustid=-1 #最开始的聚类就是数据集中的行 clust=[bicluster (rows [i],id=i) for I in range (len (rows))] while Len (clust) >1:lowestpair= (0,1) closest=distance (clust[0 ].vec,clust[1].vec) #遍历每一个配对, find the minimum distance for I in range (len (clust)): for J-in range (I+1,len (clust)
                    ): #用distances来缓存距离的计算值 if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance (Clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j ].id)] If D<closest:closest=d lowestpair= (i,j) #计算两个聚

        The average value of the class mergevec=[(Clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for I in range (len (Clust[0].vec))] #建立新的聚类 Newcluster=bicluster (mergevec,left=clust[lowestpair[0]],right=clust[lowestpair[1]],distance=closes

  T,id=currentclustid)      #不在原始集合中的聚类 with a negative ID currentclustid-=1 del clust[lowestpair[1]] [del Clust[lowestpair[0]] Clust.append (Newcluster) return clust[0]

⑤. Graphics version of the classified collection tree

#打印类似文件系统层次结构的递归遍历类聚树
def printclust (clust,lables=none,n=0):
    #利用缩进来建立层级布局 for
    i in range (n):
        print ',
    if clust.id<0:
        #负数代表分支
        print '-'
    else:
        #正数代表叶节点
        if Lables==none:
            print Clust.id
        Else:
            print lables[clust.id]
    #现在开始打印右侧分支和左侧分支
    if Clust.left!=none:
        printclust ( clust.left,lables=lables,n=n+1)
    if Clust.right!=none:
        printclust (clust.right,lables=lables,n=n+1)

⑥. Draw a tree graph to see the classified collection tree more visually
Here need to use Python PIL module, first go to the above Baidu network disk download PILLOW-3.4.2-CP27-CP27M-WIN_AMD64.WHL file, and then enter the storage file directory open cmd, execute

Pip Install PILLOW-3.4.2-CP27-CP27M-WIN_AMD64.WHL

Draw the following code

#得到聚类的高度 def gethight (clust): #叶节点的高度为1 if Clust.left==none and Clust.right==none:return 1 else: Return Gethight (Clust.left) +gethight (clust.right) def getdepth (clust): #叶节点的距离为0.0 if Clust.left==none and CLU St.right==none:return 0 #直接点的距离等于左右两侧距离的较大者 + The distance of the node itself Else:return Max (getdepth (clust.left), GETDEP Th (clust.right)) +clust.distance #生成一个图片 def drawdendrogram (clust,lables,jpeg= ' clusters.jpg '): #高度和宽度 h=gethight (c Lust) *20 w=1200 depth=getdepth (clust) #由于宽度是固定的, adjust the distance value scaling=float (w-150)/depth #新建一个白色背景图片 I Mg=image.new (' RGB ', (w,h), (255,255,255)) Draw=imagedraw.draw (IMG) draw.line ((0,H/2,10,H/2), fill= (255,0,0)) #画 The first node Drawnode (draw,clust,10, (H/2), scaling,lables) img.save (JPEG, ' jpeg ') #对于每一个点进行作图 def drawnode (Draw,clust,x,y, Scaling,lables): If Clust.id<0:h1=gethight (clust.left) *20 h2=gethight (clust.right) *20 top= Y (H1+H2)/2 Bottom=y+ (H1+H2)/2 #线的长度 ll=clust.distance*scaling #聚类到其子节点的垂直线 draw.line (X,top+h1/2,x,bott

        OM-H2/2), fill= (255,0,0)) #链接左侧节点的水平线 Draw.line ((x, Top + H1/2, X+ll, top+ h1/2), fill= (255, 0, 0))
        # link the horizontal line of the right node Draw.line (x, BOTTOM-H2/2, x + ll, BOTTOM-H2/2), fill= (255, 0, 0) #递归绘制左右节点 Drawnode (draw,clust.left,x+ll,top+h1/2,scaling,lables) Drawnode (Draw,clust.right,x+ll,bottom-h2/2,scaling,la
 bles) Else: #绘制叶节点标签 Draw.text ((x+5,y-7), Lables[clust.id], (0,0,0))

2. Column Classified Collection
The previous blog also mentioned the conversion of people and items to recommend, of course, we can also be converted to the ranks, to find the word classified collection, the conversion of the ranks of the code as follows

#转化数据集的行和列
def rotatetmatrix (data):
    newdata=[] for
    i in range (len (data[0)):
        NewRow = [Data[j][i] for J in Range (len (data))]
        newdata.append (newrow) return
    NewData

Finally executes the statement within the main function.
By executing cluster.py, we can get two pictures, respectively, for blogs and word classified collection, we can see the similarity of the same kind of clustering, we can analyze these.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.