Constructs the parsing result of the Stanford CORENLP as a JSON format

Last Update:2015-05-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first processing of English corpus requires some basic NLP processing, and the preferred tool is of course Stanford CORENLP. Since the parsing results of the Stanford CORENLP official example are not intended to be used directly, I have modified it based on it, eventually converting the parsing results to JSON format, and the parsing results of the dependent syntax are added to JSON in accordance with the format of the analytic results of hit LTP.

1, the installation of Stanford CORENLP

The latest version of Stanford CORENLP only support jdk1.8, this is more wonderful, because most of the current machine JDK is only 1.6 or 1.7, the most I downloaded the support jdk1.6 the last version, address: http://nlp.stanford.edu /software/stanford-corenlp-full-2014-08-27.zip. Once the download is complete, put all the extracted content into the root directory of the (Eclipse) project and add all the jar packages to the project library through build path to complete the installation configuration. The extracted directory has a sample file named Stanfordcorenlpdemo.java, which succinctly shows how to use the tool, but it uses a result that is prettyprint, which is easy for people to see and not easy for the machine to get. So I'm going to rewrite the code based on the example shown in http://www.cnblogs.com/tec-vegetables/p/4153144.html. 2, the code, there is a detailed explanation

Import Java.util.arraylist;import java.util.hashmap;import Java.util.list;import Java.util.map;import Java.util.properties;import Java.util.regex.matcher;import Java.util.regex.pattern;import Net.sf.json.JSONArray; Import Edu.stanford.nlp.dcoref.corefchain;import Edu.stanford.nlp.ling.coreannotations.characteroffsetbeginannotation;import Edu.stanford.nlp.ling.coreannotations.characteroffsetendannotation;import Edu.stanford.nlp.ling.coreannotations.lemmaannotation;import Edu.stanford.nlp.ling.coreannotations.namedentitytagannotation;import Edu.stanford.nlp.ling.coreannotations.partofspeechannotation;import Edu.stanford.nlp.ling.coreannotations.sentencesannotation;import Edu.stanford.nlp.ling.coreannotations.textannotation;import Edu.stanford.nlp.ling.coreannotations.tokensannotation;import Edu.stanford.nlp.ling.corelabel;import Edu.stanford.nlp.pipeline.annotation;import Edu.stanford.nlp.pipeline.stanfordcorenlp;import Edu.stanford.nlp.semgraph.semanticgraph;import Edu.stanford.nlp.semgrAPh. Semanticgraphcoreannotations.collapsedccprocesseddependenciesannotation;import Edu.stanford.nlp.trees.Tree; Import Edu.stanford.nlp.trees.treecoreannotations.treeannotation;import Edu.stanford.nlp.util.coremap;public Class TESTCORENLP {//Parameter text for the sentence to be processed public static void run (String text) {//Create a CORENLP object to set the task to complete. Tokenize: participle; ssplit: clauses; POS: part-of-speech tagging; lemma: acquiring word prototypes; parse: Syntactic parsing (with dependent syntax); Dcoref: synonymous with properties props = new properties (); Props.put ("Annotators", "Tokenize, Ssplit, POS, lemma, ner, Parse, dcoref"); STANFORDCORENLP pipeline = new STANFORDCORENLP (props);//Create a callout object based on a parameter sentence Annotation document = new Annotation (text);// The CORENLP will be processed pipeline.annotate (document) by the label object above,//get the processing result list<coremap> sentences = Document.get ( Sentencesannotation.class);//traversal of all sentences, output the processing result for each sentence for (Coremap sentence:sentences) {//traverse every word in the sentence, Gets its parsing result and constructs the JSON data jsonarray jsonsent = new Jsonarray (); Create a JSON array to hold the final parse result of the current sentence int id=1;//The ID of the current word in the sentence, starting from 1, since the original parsing result is starting from 1. Obtain the result of the current sentence's dependent parsing semanticgraph dependencies= Sentence.get (Collapsedccprocesseddependenciesannotation.class);//traverse each word for (CoreLabel token:sentence.get ( Tokensannotation.class) {//Get analysis results for each word map Mapword = new HashMap ();//Create a Map object to hold the parsing result of the current word mapword.put ("id", id);// Add the ID value mapword.put ("Cont", Token.get (Textannotation.class));//Add Word Content mapword.put ("pos", Token.get ( Partofspeechannotation.class));//Add part-of-speech callout value mapword.put ("Ner", Token.get (Namedentitytagannotation.class));// Add entity Recognition value Mapword.put ("lemma", Token.get (Lemmaannotation.class)),//Add Word prototype mapword.put ("Charbegin", Token.get ( Characteroffsetbeginannotation.class));//Add words in the beginning position of the sentence mapword.put ("Charend", Token.get ( Characteroffsetendannotation.class);//Add the word to the end of the sentence//find the corresponding dependency for each word. Due to the original parsing results, the dependencies are concentrated separately in another string variable, such as: Dependency name (dependent word-dependent word ID, dependent word-dependent Word id) \ n dependency name (dependent word-dependent word ID, dependent word-dependent Word id) \ n ... Need to parse it, the method used here is based on \ n partition, and then use regular expression to match, to get each word's dependency and dependency name int flag=0;//set flag bit, to save the current word of the dependency has been processed, 0 not processed, 1 processed string[] darray= (dependencies.tostring (SemanticGraph.OutputFormat.LIST)). Split ("\ n");//split according to \ n The result is saved as a string array for (int i=0;i<darray.length;i++)//Traversal string Array {if (flag==1)//Check whether the dependencies of the current word have already been processed, and if so, exit the traversal process immediately; ArrayList dc=getdependencycontnet (Darray[i]);//Gets the first item in the array, obtains the dependency name from it, and is placed in a ArrayList by the dependent Word ID and the dependent Word ID ( Integer.parseint (String.valueof (Dc.get (2)) ==id)//If the current Word ID equals the dependent Word ID in the current dependency, the corresponding relationship structure {mapword.put ("relation") is found. Dc.get (0));//Add dependency name Mapword.put ("Parent", Dc.get (1));//Add dependent word idflag=1; Set the current Word dependency flag to 1break;//exit traversal}}jsonsent.add (Mapword);//Add all of the above results to the current sentence id++;//word ID self-increment}system.out.println (jsonsent); ////Get and print the parsing tree//Sentence.get (Treeannotation.class);//System.out.println ("\  N "+tree.tostring ()); Gets and prints the result of the dependent syntax//system.out.println ("\ndependency graph:\n" +dependencies.tostring (semanticgraph.outputform At. LIST)/////Get and print the entity reference result//Map<integer, corefchain> graph = document.get (corefchainannotation . Class);//System.out.println (graph);}} The method that resolves the dependency value. For example, get a ArrayList from root (Abc-1, efg-3) with a value of [Root,1,3]public static ArraYlist getdependencycontnet (String sent) {string str=sent; ArrayList result=new ArrayList (); String patternname= "(. *) \ \ ("; String patterngid= "\ \ (. * ([0-9]*) \ \,"; String patterndid= ". * * ([0-9]*) \ \)"; Pattern r = pattern.compile (Patternname); Matcher m = r.matcher (str), if (M.find ()) {Result.add (M.group (1));} R=pattern.compile (patterngid); m = R.matcher (str); if (M.find ()) {Result.add (M.group (1));} R=pattern.compile (patterndid); m = R.matcher (str); if (M.find ()) {Result.add (M.group (1));} return (result);}}

To "Beijing is the capital of the PRC." As an example, the result is:[{"id": 1, "lemma": "Beijing", "Relation": "NSUBJ", "Parent": "4", "NER": "Location", "Charend": 7, "cont": "Beijing", " Charbegin ": 0," pos ":" NNP "},{" id ": 2," lemma ":" Be "," relation ":" Cop "," Parent ":" 4 "," ner ":" O "," Charend ": Ten," cont ":" Is "," Charbegin ": 8," pos ":" VBZ "},{" id ": 3," lemma ":" The "," Relation ":" Det "," Parent ":" 4 "," ner ":" O "," Charend ":" Cont " : "The", "Charbegin": One, "pos": "DT"},{"id": 4, "lemma": "Capital", "Relation": "Root", "parent": "0", "ner": "O", "Charend" : "cont": "Capital", "Charbegin": "POS": "NN"},{"id": 5, "lemma": "of", "ner": "O", "Charend": "Cont": "of", " Charbegin ":", "POS": "In"},{"id": 6, "lemma": "China", "Relation": "Prep_of", "Parent": "4", "NER": "Location", "Charend" : "Cont": "China", "charbegin": +, "pos": "NNP"},{"id": 7, "lemma": ".", "ner": "O", "charend": +, "cont": ".", " Charbegin ":", "pos": "."}]

Constructs the parsing result of the Stanford CORENLP as a JSON format

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Constructs the parsing result of the Stanford CORENLP as a JSON format

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Constructs the parsing result of the Stanford CORENLP as a JSON format

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support