Constructs the parsing result of the Stanford CORENLP as a JSON format

Source: Internet
Author: User

The first processing of English corpus requires some basic NLP processing, and the preferred tool is of course Stanford CORENLP. Since the parsing results of the Stanford CORENLP official example are not intended to be used directly, I have modified it based on it, eventually converting the parsing results to JSON format, and the parsing results of the dependent syntax are added to JSON in accordance with the format of the analytic results of hit LTP.

1, the installation of Stanford CORENLP

The latest version of Stanford CORENLP only support jdk1.8, this is more wonderful, because most of the current machine JDK is only 1.6 or 1.7, the most I downloaded the support jdk1.6 the last version, address: http://nlp.stanford.edu /software/stanford-corenlp-full-2014-08-27.zip. Once the download is complete, put all the extracted content into the root directory of the (Eclipse) project and add all the jar packages to the project library through build path to complete the installation configuration. The extracted directory has a sample file named Stanfordcorenlpdemo.java, which succinctly shows how to use the tool, but it uses a result that is prettyprint, which is easy for people to see and not easy for the machine to get. So I'm going to rewrite the code based on the example shown in http://www.cnblogs.com/tec-vegetables/p/4153144.html. 2, the code, there is a detailed explanation
Import Java.util.arraylist;import java.util.hashmap;import Java.util.list;import Java.util.map;import Java.util.properties;import Java.util.regex.matcher;import Java.util.regex.pattern;import Net.sf.json.JSONArray; Import Edu.stanford.nlp.dcoref.corefchain;import Edu.stanford.nlp.ling.coreannotations.characteroffsetbeginannotation;import Edu.stanford.nlp.ling.coreannotations.characteroffsetendannotation;import Edu.stanford.nlp.ling.coreannotations.lemmaannotation;import Edu.stanford.nlp.ling.coreannotations.namedentitytagannotation;import Edu.stanford.nlp.ling.coreannotations.partofspeechannotation;import Edu.stanford.nlp.ling.coreannotations.sentencesannotation;import Edu.stanford.nlp.ling.coreannotations.textannotation;import Edu.stanford.nlp.ling.coreannotations.tokensannotation;import Edu.stanford.nlp.ling.corelabel;import Edu.stanford.nlp.pipeline.annotation;import Edu.stanford.nlp.pipeline.stanfordcorenlp;import Edu.stanford.nlp.semgraph.semanticgraph;import Edu.stanford.nlp.semgrAPh. Semanticgraphcoreannotations.collapsedccprocesseddependenciesannotation;import Edu.stanford.nlp.trees.Tree; Import Edu.stanford.nlp.trees.treecoreannotations.treeannotation;import Edu.stanford.nlp.util.coremap;public Class TESTCORENLP {//Parameter text for the sentence to be processed public static void run (String text) {//Create a CORENLP object to set the task to complete. Tokenize: participle; ssplit: clauses; POS: part-of-speech tagging; lemma: acquiring word prototypes; parse: Syntactic parsing (with dependent syntax); Dcoref: synonymous with properties props = new properties (); Props.put ("Annotators", "Tokenize, Ssplit, POS, lemma, ner, Parse, dcoref"); STANFORDCORENLP pipeline = new STANFORDCORENLP (props);//Create a callout object based on a parameter sentence Annotation document = new Annotation (text);// The CORENLP will be processed pipeline.annotate (document) by the label object above,//get the processing result list<coremap> sentences = Document.get ( Sentencesannotation.class);//traversal of all sentences, output the processing result for each sentence for (Coremap sentence:sentences) {//traverse every word in the sentence, Gets its parsing result and constructs the JSON data jsonarray jsonsent = new Jsonarray (); Create a JSON array to hold the final parse result of the current sentence int id=1;//The ID of the current word in the sentence, starting from 1, since the original parsing result is starting from 1. Obtain the result of the current sentence's dependent parsing semanticgraph dependencies= Sentence.get (Collapsedccprocesseddependenciesannotation.class);//traverse each word for (CoreLabel token:sentence.get ( Tokensannotation.class) {//Get analysis results for each word map Mapword = new HashMap ();//Create a Map object to hold the parsing result of the current word mapword.put ("id", id);// Add the ID value mapword.put ("Cont", Token.get (Textannotation.class));//Add Word Content mapword.put ("pos", Token.get ( Partofspeechannotation.class));//Add part-of-speech callout value mapword.put ("Ner", Token.get (Namedentitytagannotation.class));// Add entity Recognition value Mapword.put ("lemma", Token.get (Lemmaannotation.class)),//Add Word prototype mapword.put ("Charbegin", Token.get ( Characteroffsetbeginannotation.class));//Add words in the beginning position of the sentence mapword.put ("Charend", Token.get ( Characteroffsetendannotation.class);//Add the word to the end of the sentence//find the corresponding dependency for each word. Due to the original parsing results, the dependencies are concentrated separately in another string variable, such as: Dependency name (dependent word-dependent word ID, dependent word-dependent Word id) \ n dependency name (dependent word-dependent word ID, dependent word-dependent Word id) \ n ... Need to parse it, the method used here is based on \ n partition, and then use regular expression to match, to get each word's dependency and dependency name int flag=0;//set flag bit, to save the current word of the dependency has been processed, 0 not processed, 1 processed string[] darray= (dependencies.tostring (SemanticGraph.OutputFormat.LIST)). Split ("\ n");//split according to \ n The result is saved as a string array for (int i=0;i<darray.length;i++)//Traversal string Array {if (flag==1)//Check whether the dependencies of the current word have already been processed, and if so, exit the traversal process immediately; ArrayList dc=getdependencycontnet (Darray[i]);//Gets the first item in the array, obtains the dependency name from it, and is placed in a ArrayList by the dependent Word ID and the dependent Word ID ( Integer.parseint (String.valueof (Dc.get (2)) ==id)//If the current Word ID equals the dependent Word ID in the current dependency, the corresponding relationship structure {mapword.put ("relation") is found. Dc.get (0));//Add dependency name Mapword.put ("Parent", Dc.get (1));//Add dependent word idflag=1; Set the current Word dependency flag to 1break;//exit traversal}}jsonsent.add (Mapword);//Add all of the above results to the current sentence id++;//word ID self-increment}system.out.println (jsonsent); ////Get and print the parsing tree//Sentence.get (Treeannotation.class);//System.out.println ("\  N "+tree.tostring ()); Gets and prints the result of the dependent syntax//system.out.println ("\ndependency graph:\n" +dependencies.tostring (semanticgraph.outputform At. LIST)/////Get and print the entity reference result//Map<integer, corefchain> graph = document.get (corefchainannotation . Class);//System.out.println (graph);}} The method that resolves the dependency value. For example, get a ArrayList from root (Abc-1, efg-3) with a value of [Root,1,3]public static ArraYlist getdependencycontnet (String sent) {string str=sent; ArrayList result=new ArrayList (); String patternname= "(. *) \ \ ("; String patterngid= "\ \ (. * ([0-9]*) \ \,"; String patterndid= ". * * ([0-9]*) \ \)"; Pattern r = pattern.compile (Patternname); Matcher m = r.matcher (str), if (M.find ()) {Result.add (M.group (1));} R=pattern.compile (patterngid); m = R.matcher (str); if (M.find ()) {Result.add (M.group (1));} R=pattern.compile (patterndid); m = R.matcher (str); if (M.find ()) {Result.add (M.group (1));} return (result);}}

To "Beijing is the capital of the PRC." As an example, the result is:[{"id": 1, "lemma": "Beijing", "Relation": "NSUBJ", "Parent": "4", "NER": "Location", "Charend": 7, "cont": "Beijing", " Charbegin ": 0," pos ":" NNP "},{" id ": 2," lemma ":" Be "," relation ":" Cop "," Parent ":" 4 "," ner ":" O "," Charend ": Ten," cont ":" Is "," Charbegin ": 8," pos ":" VBZ "},{" id ": 3," lemma ":" The "," Relation ":" Det "," Parent ":" 4 "," ner ":" O "," Charend ":" Cont " : "The", "Charbegin": One, "pos": "DT"},{"id": 4, "lemma": "Capital", "Relation": "Root", "parent": "0", "ner": "O", "Charend" : "cont": "Capital", "Charbegin": "POS": "NN"},{"id": 5, "lemma": "of", "ner": "O", "Charend": "Cont": "of", " Charbegin ":", "POS": "In"},{"id": 6, "lemma": "China", "Relation": "Prep_of", "Parent": "4", "NER": "Location", "Charend" : "Cont": "China", "charbegin": +, "pos": "NNP"},{"id": 7, "lemma": ".", "ner": "O", "charend": +, "cont": ".", " Charbegin ":", "pos": "."}]

Constructs the parsing result of the Stanford CORENLP as a JSON format

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.