11.4 use Toolbox data
Use XML in the Language Structure
(2) <entry>
Role of XML
(For more basic XML knowledge, please query relevant information by yourself)
ElementTree Interface
>>>>>>merchant=>>><Element PLAYat 22fa800> >>><ElementTITLEat 22fa828> >>>>>><Element TITLEat 22fa828>, <Element PERSONAE at 22fa7b0>, <2300170><ElementPLAYSUBTat 2300198>, <ElementACTat 23001e8>, <ElementACTat 2><ElementACTat 23c87d8>, <ElementACTat 2439198>, <>]
We can use more methods to operate XML:
>>> i, act enumerate(merchant.findall( j, scene enumerate(act.findall( k,speechin enumerate(scene.findall( line speech.findall( %(i+1, j+1, k+1
We can also check the sequence of actors. We can use frequency distribution to see who can best say:
>>>speaker_seq = [s.text s merchant.findall(>>>speaker_freq =>>>top5 =speaker_freq.keys()[:5>>>, , , , ]
We can also view who follows the conversation.
>>>mapping= nltk.defaultdict(: >>> s = s[:4>>>speaker_seq2 = [mapping[s] s >>>cfd =>>>cfd.tabulate()
Use ElementTree to access Toolbox data
We can use toolbox. xml () to access Toolbox files.
>>>>>>lexicon = toolbox.xml()
You can access the content in this way:
>>>lexicon[3<Element lx at 77bd28>>>>lexicon[3>>>lexicon[3
You can also use the path to access the XML content:
>>>[lexeme.text.lower() lexeme lexicon.findall(, , , , , , , , , , , ..., ]
>>>>>>>>>tree = ElementTree(lexicon[3>>><record><lx>kaa</lx><ps>N</ps><pt>MASC</pt><cl>isi</cl><ge>cookingbanana</ge><tkp>bananabilong kukim</tkp><pt>itoo</pt><sf>FLORA</sf><dt>12/Aug/2005</dt><ex>Taeaviiria kaaisi kovopaueva kaparapasia.</ex><xp>Taeavii bin planim gadenbanana bilongkukim tasol long paia.</xp><xe>Taeaviplantedbanana orderto cookit.</xe></record>
Format entries
We can generate specific format output based on our own needs.
>>>html= >>> entry lexicon[70:80= entry.findtext(= entry.findtext(= entry.findtext(+=%>>>html+=>>><table><tr><td>kakae</td><td>???</td><td>small</td></tr><tr><td>kakae</td><td>CLASS</td><td>child</td></tr><tr><td>kakaevira</td><td>ADV</td><td>small-like</td></tr><tr><td>kakapikoa</td><td>???</td><td>small</td></tr><tr><td>kakapikoto</td><td>N</td><td>newbornbaby</td></tr><tr><td>kakapu</td><td>V</td><td>placein sling purposeof carrying</td></tr><tr><td>kakapua</td><td>N</td><td>slingfor lifting</td></tr><tr><td>kakara</td><td>N</td><td>armband</td></tr><tr><td>Kakarapaia</td><td>N</td><td>villagename</td></tr><tr><td>kakarau</td><td>N</td><td>frog</td></tr></table>
11.5 use Toolbox data
Add a field for each entry
Example 11-2 = re. sub (r, r = re. sub (r, r = re. sub (r, r field. tag === SubElement (entry, ==>> lexicon = toolbox. xml (>>> add_cv_field (lexicon [53 >>> nltk. to_sfm_string (lexicon [53103/Jun/2005
Verify Toolbox vocabulary
Many words in Toolbox format do not conform to any specific mode. Some entries may include additional fields or sort existing fields in a new way.
For example, with the help of FreqDist, we can easily find the sequence of fields with frequency exceptions:
>>>fd = nltk.FreqDist(.join(field.tag field entry) entry >>>, 41),(, 37, 27), (, 20, 1)]