Python implementation based on co-discovery of the relationship between characters

Last Update:2018-04-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reference Links:
To extract the "Busan line" character Relations,
Draw a beautiful network Diagram with Python's networkx

1. Co-existing relationship

In the literature metrology, the common word method of keywords is used to determine the relationship between the subjects in the subject of the literature set. Here, we need to analyze a novel or a script to analyze the character relationships among the characters in the play. The two have a very similar place.

In general, we believe that there must be some correlation between the two characters appearing in the same paragraph in an article, so the approximate process of our program can be determined. We can do the participle first, the characters in each paragraph extracted, and then in paragraph units, the number of occurrences of two characters, and the result exists in a two-dimensional matrix. This matrix can also be used as a matrix of graphs, and the elements in the matrix (the number of occurrences of statistics) are the weights of the edges.

For example, for example, the existing three-paragraph participle results are as follows: A/B/C,B/A/F,A/D/C, then is AB Total 2 times, AC total 2 times, and so on.

At the same time, in order to facilitate, we put the character and the relationship also through the document record, we want to analyze the relationship is from the name of the person (novel)

2.jieba participle

Jieba participle of the principle and grammar can refer to this article, "The Basic principles of Chinese word segmentation and the use of Jieba participle"

Although there are jieba participle can analyze the article, but still not very accurate. For example, the name of a person name in a role called "easy to Learn", "Yi" is an adverb, "learning" is a verb, so it is difficult to take this man out. However, the stutter participle provides a custom dictionary, we can according to the previous participle results, 1.1 points to correct their own dictionary. Of course, I suggest that when building a custom dictionary, it is best to directly copy the character list directly from the name of the name, and all of the parts of speech are labeled NR (person name).

So we can filter out the names by first participle and then by filtering the way of speech. After filtering, it is recorded in a list of each paragraph and is used for the subsequent matrix composition.

This process is done in a paragraph, so you can set a global dictionary to record the weight of each character (that is, the word frequency statistic). The code is as follows:

# The script will be participle, and will represent the name of the word presented, the other discontinued words and punctuation omitted# When a name is presented, the same name dictionary is recorded as the row and column of the MatrixdefCut_word (text): words=Pseg.cut (text) l_name=[] forXinchWords:ifX.flag!=' nr ' or Len(X.word)< 2:Continue        if  notNames.get (X.word): Names[x.word]=1        Else: Names[x.word]=Names[x.word]+1L_name.append (X.word)returnL_name# Build a word frequency dictionary and a list of people in each paragraphdefNamedict_built ():GlobalNames with Open(' E:/py/relationship_find/test.txt ',' R ') asF: forLinchF.readlines (): N=Cut_word (L)if Len(n)>=2:# Empty list and cell list are not used because of the relationship to be calculatedLines.append (n) Names=Dict(Sorted(Names.items (), key= Lambdax:x[1],reverse= True)[: $])# Print (line)

3. Build The Matrix

Although a matrix is spoken, it is actually done in code using a two-dimensional dictionary, because it is quicker to access. Statistics are also very simple (Bao) single (LI), that is, we are in the above-mentioned each paragraph of the character list is traversed again ...

Because, the word segmentation results always have some strange words, so, when we construct the matrix, we directly base on the characters in the names in the code above, filter out the other words that are not in names, otherwise there will be other things in the mess. The code is as follows:

# Build a contribution matrix by traversing linesdefRelation_built (): forKeyinchNames:relationships[key]={} forLineinchLines: forName1inchLineif  notNames.get (name1):Continue             forName2inchLineifName1==Name2or( notNames.get (name2)):Continue                if  notRelationships[name1].get (name2): relationships[name1][name2]= 1                Else: Relationships[name1][name2]=RELATIONSHIPS[NAME1][NAME2]+ 1    # Print (Relationships)

Networkx+matplotlib Drawing

With the previous relationships matrix, we can do the network diagram with the right edge according to the Matrix. This drawing method online tutorial countless, specifically do not record, the code is probably like this:

defGraph_show (): mpl.rcparams[' Font.sans-serif ']=[' Fangsong ']# Specify default fontmpl.rcparams[' Axes.unicode_minus ']= False # Fix save image is minus sign '-' Display as block problemG=Nx. Graph ()# in Networkx, a node can be any hash object, like a text string, an image, an XML object, or even another diagram or any custom node object     with Open(' E:/py/relationship_find/edge.txt ',' R ') asF: forIinchF.readlines (): Line=Str(i). Split ()ifLine==[]:Continue            if int(line[2])<= -:ContinueG.add_weighted_edges_from ([(line[0],line[1],int(line[2])]) Nx.draw (g,pos=Nx.shell_layout (G), node_size= +, Node_color= ' #A0CBE2 ', Edge_color=' #A0CBE1 ', With_labels= True, font_size= A) Plt.show ()

Made out of the figure. Pretty ugly to tell the truth, but at least it's a good picture.

Python implementation based on co-discovery of the relationship between characters

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python implementation based on co-discovery of the relationship between characters

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python implementation based on co-discovery of the relationship between characters

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support