Python implementation based on co-discovery of the relationship between characters

Source: Internet
Author: User

Python implementation based on co-discovery of the relationship between characters

Reference Links:
To extract the "Busan line" character Relations,
Draw a beautiful network Diagram with Python's networkx

1. Co-existing relationship

In the literature metrology, the common word method of keywords is used to determine the relationship between the subjects in the subject of the literature set. Here, we need to analyze a novel or a script to analyze the character relationships among the characters in the play. The two have a very similar place.

In general, we believe that there must be some correlation between the two characters appearing in the same paragraph in an article, so the approximate process of our program can be determined. We can do the participle first, the characters in each paragraph extracted, and then in paragraph units, the number of occurrences of two characters, and the result exists in a two-dimensional matrix. This matrix can also be used as a matrix of graphs, and the elements in the matrix (the number of occurrences of statistics) are the weights of the edges.

For example, for example, the existing three-paragraph participle results are as follows: A/B/C,B/A/F,A/D/C, then is AB Total 2 times, AC total 2 times, and so on.

At the same time, in order to facilitate, we put the character and the relationship also through the document record, we want to analyze the relationship is from the name of the person (novel)

2.jieba participle

Jieba participle of the principle and grammar can refer to this article, "The Basic principles of Chinese word segmentation and the use of Jieba participle"

Although there are jieba participle can analyze the article, but still not very accurate. For example, the name of a person name in a role called "easy to Learn", "Yi" is an adverb, "learning" is a verb, so it is difficult to take this man out. However, the stutter participle provides a custom dictionary, we can according to the previous participle results, 1.1 points to correct their own dictionary. Of course, I suggest that when building a custom dictionary, it is best to directly copy the character list directly from the name of the name, and all of the parts of speech are labeled NR (person name).

So we can filter out the names by first participle and then by filtering the way of speech. After filtering, it is recorded in a list of each paragraph and is used for the subsequent matrix composition.

This process is done in a paragraph, so you can set a global dictionary to record the weight of each character (that is, the word frequency statistic). The code is as follows:

# The script will be participle, and will represent the name of the word presented, the other discontinued words and punctuation omitted# When a name is presented, the same name dictionary is recorded as the row and column of the MatrixdefCut_word (text): words=Pseg.cut (text) l_name=[] forXinchWords:ifX.flag!=' nr ' or Len(X.word)< 2:Continue        if  notNames.get (X.word): Names[x.word]=1        Else: Names[x.word]=Names[x.word]+1L_name.append (X.word)returnL_name# Build a word frequency dictionary and a list of people in each paragraphdefNamedict_built ():GlobalNames with Open(' E:/py/relationship_find/test.txt ',' R ') asF: forLinchF.readlines (): N=Cut_word (L)if Len(n)>=2:# Empty list and cell list are not used because of the relationship to be calculatedLines.append (n) Names=Dict(Sorted(Names.items (), key= Lambdax:x[1],reverse= True)[: $])# Print (line)
3. Build The Matrix

Although a matrix is spoken, it is actually done in code using a two-dimensional dictionary, because it is quicker to access. Statistics are also very simple (Bao) single (LI), that is, we are in the above-mentioned each paragraph of the character list is traversed again ...

Because, the word segmentation results always have some strange words, so, when we construct the matrix, we directly base on the characters in the names in the code above, filter out the other words that are not in names, otherwise there will be other things in the mess. The code is as follows:

# Build a contribution matrix by traversing linesdefRelation_built (): forKeyinchNames:relationships[key]={} forLineinchLines: forName1inchLineif  notNames.get (name1):Continue             forName2inchLineifName1==Name2or( notNames.get (name2)):Continue                if  notRelationships[name1].get (name2): relationships[name1][name2]= 1                Else: Relationships[name1][name2]=RELATIONSHIPS[NAME1][NAME2]+ 1    # Print (Relationships)
Networkx+matplotlib Drawing

With the previous relationships matrix, we can do the network diagram with the right edge according to the Matrix. This drawing method online tutorial countless, specifically do not record, the code is probably like this:

defGraph_show (): mpl.rcparams[' Font.sans-serif ']=[' Fangsong ']# Specify default fontmpl.rcparams[' Axes.unicode_minus ']= False # Fix save image is minus sign '-' Display as block problemG=Nx. Graph ()# in Networkx, a node can be any hash object, like a text string, an image, an XML object, or even another diagram or any custom node object     with Open(' E:/py/relationship_find/edge.txt ',' R ') asF: forIinchF.readlines (): Line=Str(i). Split ()ifLine==[]:Continue            if int(line[2])<= -:ContinueG.add_weighted_edges_from ([(line[0],line[1],int(line[2])]) Nx.draw (g,pos=Nx.shell_layout (G), node_size= +, Node_color= ' #A0CBE2 ', Edge_color=' #A0CBE1 ', With_labels= True, font_size= A) Plt.show ()

Made out of the figure. Pretty ugly to tell the truth, but at least it's a good picture.

Python implementation based on co-discovery of the relationship between characters

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.