Introduction to Apache Gora in nutch2.0

Source: Internet
Author: User
Tags cassandra

Introduction to Apache Gora in nutch 2.0

-----------------

1. What is Apache Gora?

Apache Gora is an open-source ORM framework that provides memory data models and data persistence for big data. Currently, Gora supports the storage of column data, key-value data, document data, and RDBMS data. It also supports the use of Apache hadoop for big data analysis.

2. Why use Apache Gora?

Although there are many good relational database ORM frameworks on the market, the data model-based frameworks such as JDO still have some shortcomings, such as the storage and persistence of the column data model. Gora makes up for this problem. It makes it easy for users to model and persist row memory for big data, and supports hadoop to analyze big data.

To put it bluntly, Gora is a big data representation and persistence framework, which has the following features:

  • Data Persistence: persistence of column data, such as hbase, Cassandra, hypertable, and key-value data, such as voldermort and redis, and persistence of SQL databases, such as MySQL, HSQLDB, you can also store files in HDFS.
  • Data Access: You can use Java APIs to easily access data.
  • Index: objects can be persisted to Lucene or SOLR, And Gora APIs can be used for query.
  • Analysis: Apache pig, hive, and cascading can be used for data analysis.
  • Mr support: native support for hadoop's Mr framework, which has already been used on nutch 2.0
3. A source code structure of Gora

The Gora source code is organized in the form of modules, where Gora-core is the main core module. All other modules depend on this core module. Of course, you can expand your modules. The current modules are as follows:

  • Gora-core: Core Module
  • Gora-CASSANDRA: Apache Cassandra Module
  • Gora-hbase: Apache hbase Module
  • Gora-SQL: SQL database module

4. A simple example

The following example is based on Gora-tutorial in the source code of Gora. This directory is included in the source code package of Gora.
This example mainly provides two functions: one is to put the test data in the Directory into hbase, and the other is to perform Mr analysis on the data in hbase.

Next, let's look at the first feature:
Currently, Gora 0.2 only supports hbase 0.90. hbase can be downloaded on hbase.apache.org. After the download is completed, a simple service is started, it starts namenode + regionserver and zookeeper in a JVM process. The command is as follows:

  $ bin/start-hbase.sh

Run the following command to test

  bin/hbase shell

To parse the test data in the source code. The command is as follows:

  $ tar zxvf src/main/resources/access.log.tar.gz -C src/main/resources/

The data format is as follows:

  88.254.190.73 - - [10/Mar/2009:20:40:26 +0200] "GET / HTTP/1.1" 200 43 "http://www.buldinle.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB5; .NET CLR 2.0.50727; InfoPath.2)" 78.179.56.27 - - [11/Mar/2009:00:07:40 +0200] "GET /index.php?i=3&a=1__6x39kovbji8&k=3750105 HTTP/1.1" 200 43 "http://www.buldinle.com/index.php?i=3&a=1__6X39Kovbji8&k=3750105" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; OfficeLiveConnector.1.3; OfficeLivePatch.0.0)" 78.163.99.14 - - [12/Mar/2009:18:18:25 +0200] "GET /index.php?a=3__x7l72c&k=4476881 HTTP/1.1" 200 43 "http://www.buldinle.com/index.php?a=3__x7l72c&k=4476881" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; InfoPath.1)" 

The data model is defined. Gora uses Apache Avro to define the object model of data. Avro can conveniently provide the Persistence State and Object Persistence function of the number of objects to be tracked. It is relatively simple to define the data model. It is defined in JSON format. The format of the pageview (src/main/Avro/pageview. JSON) used is as follows:

   {  "type": "record",  "name": "Pageview",  "namespace": "org.apache.gora.tutorial.log.generated",  "fields" : [    {"name": "url", "type": "string"},    {"name": "timestamp", "type": "long"},    {"name": "ip", "type": "string"},    {"name": "httpMethod", "type": "string"},    {"name": "httpStatusCode", "type": "int"},    {"name": "responseSize", "type": "int"},    {"name": "referrer", "type": "string"},    {"name": "userAgent", "type": "string"}  ]}

We can see that the data type is record, the name is the name of the generated class, And the namespace is the name of the package in Java, and the name and type of the field defined in fields.

Compile the defined data model and automatically generate the corresponding code. The command is as follows:

   $ bin/gora compile   $ Usage: SpecificCompiler <schema file> <output dir>   $ bin/gora compile gora-tutorial/src/main/avro/pageview.json gora-tutorial/src/main/java/

After compilation, generate the file Gora-tutorial/src/main/Java/org/Apache/Gora/tutorial/log/generated/pageview. Java

Here, the Gora parser extends Avro's specificcompiler method, because the generated object model is extended from Gora's own persistent interface, and the persistent interface defines the object persistence, object status tracking and other methods. The following is some code for pageview. java.

   public class Pageview extends PersistentBase {  private Utf8 url;  private long timestamp;  private Utf8 ip;  private Utf8 httpMethod;  private int httpStatusCode;  private int responseSize;  private Utf8 referrer;  private Utf8 userAgent;  ...  public static final Schema _SCHEMA = Schema.parse("{\"type\":\"record\", ... ");  public static enum Field {    URL(0,"url"),    TIMESTAMP(1,"timestamp"),    IP(2,"ip"),    HTTP_METHOD(3,"httpMethod"),    HTTP_STATUS_CODE(4,"httpStatusCode"),    RESPONSE_SIZE(5,"responseSize"),    REFERRER(6,"referrer"),    USER_AGENT(7,"userAgent"),    ;    private int index;    private String name;    Field(int index, String name) {this.index=index;this.name=name;}    public int getIndex() {return index;}    public String getName() {return name;}    public String toString() {return name;}  };  public static final String[] _ALL_FIELDS = {"url","timestamp","ip","httpMethod"    ,"httpStatusCode","responseSize","referrer","userAgent",};    ...  }    

We can see the field declaration. Note that in Avro, The utf8 class is used to implement the string. We can also see the Avro schema Declaration and the embedded Enumeration type field.


Gora can easily define different types of data models, such as column data models (hbase, Cassandra), SQL data models, and file models, such as JSON, XML, there are also key-value data models, and the ing between these data models and data storage is defined in an XML file. Each Data Storage abstraction has a corresponding ing format, this ing file declares a ing relationship between the class field defined in Avro schema to the data storage abstraction, the above example of an hbase ing file, gora-hbase-mappings.xml

    <gora-orm>  <table name="Pageview"> <!-- optional descriptors for tables -->    <family name="common"/> <!-- This can also have params like compression, bloom filters -->    <family name="http"/>    <family name="misc"/>  </table>  <class name="org.apache.gora.tutorial.log.generated.Pageview" keyClass="java.lang.Long" table="AccessLog">    <field name="url" family="common" qualifier="url"/>    <field name="timestamp" family="common" qualifier="timestamp"/>    <field name="ip" family="common" qualifier="ip" />    <field name="httpMethod" family="http" qualifier="httpMethod"/>    <field name="httpStatusCode" family="http" qualifier="httpStatusCode"/>    <field name="responseSize" family="http" qualifier="responseSize"/>    <field name="referrer" family="misc" qualifier="referrer"/>    <field name="userAgent" family="misc" qualifier="userAgent"/>  </class>    ...  </gora-orm>        

We can see that the ing file is marked as the header node with <Gora-ORM>. The Gora file of hbase has two types of nodes: Table and class. the table here is optional. It is generally used to define some attributes of a table, such as compression and block buffering. The aving between the Avro-defined class structure defined by class and data storage, the Class Name Defined by name, and the input <K, v> for the K type in the column, the last table indicates the table name corresponding to hbase, while the field has three attributes. The first one is name, which indicates the member name of the class, in hbase, it also indicates column.
Family label. The second is family, which indicates column family in hbase, and the third is qualifier, which indicates column family in hbase.

The following is an example of running logmanager.

  $ bin/gora logmanagerwhich lists the usage as:LogManager -parse <input_log_file>           -get <lineNum>           -query <lineNum>           -query <startLineNum> <endLineNum>           -delete <lineNum>           -deleteByQuery <startLineNum> <endLineNum>

Run the following command to run the parsing program:

$ bin/gora logmanager -parse gora-tutorial/src/main/resources/access.log

You can use the following command to view hbase results:

hbase(main):004:0> scan 'AccessLog', {LIMIT=>1}ROW                               COLUMN+CELL                                                                                     \x00\x00\x00\x00\x00\x00\x00\x00 column=common:ip, timestamp=1342791952462, value=88.240.129.183                                 \x00\x00\x00\x00\x00\x00\x00\x00 column=common:timestamp, timestamp=1342791952462, value=\x00\x00\x01\x1F\xF1\xAElP              \x00\x00\x00\x00\x00\x00\x00\x00 column=common:url, timestamp=1342791952462, value=/index.php?a=1__wwv40pdxdpo&k=218978          \x00\x00\x00\x00\x00\x00\x00\x00 column=http:httpMethod, timestamp=1342791952462, value=GET                                      \x00\x00\x00\x00\x00\x00\x00\x00 column=http:httpStatusCode, timestamp=1342791952462, value=\x00\x00\x00\xC8                     \x00\x00\x00\x00\x00\x00\x00\x00 column=http:responseSize, timestamp=1342791952462, value=\x00\x00\x00+                          \x00\x00\x00\x00\x00\x00\x00\x00 column=misc:referrer, timestamp=1342791952462, value=http://www.buldinle.com/index.php?a=1__WWV                                  40pdxdpo&k=218978                                                                               \x00\x00\x00\x00\x00\x00\x00\x00 column=misc:userAgent, timestamp=1342791952462, value=Mozilla/4.0 (compatible; MSIE 6.0; Window                                  s NT 5.1)                                                                                      1 row(s) in 0.0180 seconds

Next, let's analyze this program. The file is under Gora-tutorial/src/main/Java/org/Apache/Gora/tutorial/log/logmanager. java.

  • Initialization

 public LogManager() {    try {      init();    } catch (IOException ex) {      throw new RuntimeException(ex);    }  }  private void init() throws IOException {    dataStore = DataStoreFactory.getDataStore(Long.class, Pageview.class);  }

This method is mainly used to initialize datastore and use the static method of datastorefactory. Two Parameters represent <K, V>, where datastore is an important abstraction in Gora, it is used to operate on Object Persistence. We need to use it to implement different modules, such as hbase and SQL.

The following code is used to parse logs and generate pageview objects.

    private void parse(String input) throws IOException, ParseException {    BufferedReader reader = new BufferedReader(new FileReader(input));    long lineCount = 0;    try {      String line = reader.readLine();      do {        Pageview pageview = parseLine(line);                if(pageview != null) {          //store the pageview           storePageview(lineCount++, pageview);        }                line = reader.readLine();      } while(line != null);          } finally {      reader.close();      }  }      private Pageview parseLine(String line) throws ParseException {    StringTokenizer matcher = new StringTokenizer(line);    //parse the log line    String ip = matcher.nextToken();    ...        //construct and return pageview object    Pageview pageview = new Pageview();    pageview.setIp(new Utf8(ip));    pageview.setTimestamp(timestamp);    ...        return pageview;  }

The last step is to store objects in the background storage model. Don't forget to close the storage after use.

    /** Stores the pageview object with the given key */  private void storePageview(long key, Pageview pageview) throws IOException {    dataStore.put(key, pageview);  }    private void close() throws IOException {    //It is very important to close the datastore properly, otherwise    //some data loss might occur.    if(dataStore != null)      dataStore.close();  }

Retrieve data from the database

    /** Fetches a single pageview object and prints it*/  private void get(long key) throws IOException {    Pageview pageview = dataStore.get(key);    printPageview(pageview);  }

Query object

    /** Queries and prints pageview object that have keys between startKey and endKey*/  private void query(long startKey, long endKey) throws IOException {    Query<Long, Pageview> query = dataStore.newQuery();    //set the properties of query    query.setStartKey(startKey);    query.setEndKey(endKey);        Result<Long, Pageview> result = query.execute();        printResult(result);  }    

Delete object

  /**Deletes the pageview with the given line number */  private void delete(long lineNum) throws Exception {    dataStore.delete(lineNum);    dataStore.flush(); //write changes may need to be flushed before                       //they are committed   }    /** This method illustrates delete by query call */  private void deleteByQuery(long startKey, long endKey) throws IOException {    //Constructs a query from the dataStore. The matching rows to this query will be deleted    QueryLong, Pageview> query = dataStore.newQuery();    //set the properties of query    query.setStartKey(startKey);    query.setEndKey(endKey);        dataStore.deleteByQuery(query);  }   

5. Reference

Http://gora.apache.org/docs/current/quickstart.html

Http://gora.apache.org/docs/current/tutorial.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.