Storing RDF data into HBase?

Source: Internet
Author: User
Tags cassandra
Login | FAQ


  • Questions
  • Tags
  • Users
  • Badges
  • Unanswered
  • Ask Question
Storing RDF data into hbase?

3

7

What wocould be the best way for storing and quering RDF data (triples or quads) using hbase?

Update:

I 've just found this:

  • Http://cs.utdallas.edu/semanticweb/HBase-Extension/

Unfortunately, the .tar.gz has been removed from the utdallas.edu website.

RDF triplestore hbase query Storage
Flag Edited dec 14 at 16: 37 Asked Apr 12 at 20: 19 castagna
944 ● 2 ● 14
1  
Storing is easy, isn' t the complex part the querying especially joins? -Ian Davis Apr 12 at 22: 08
Changed the question to: "for storing and queryng" but, storing wocould be the first step and it wocould have the advantage of being able to easily use mapreduce over it. for querying I wocould look at pig and how to use hbase as input/output source/destination for piglatin scripts (see: semanticoverflow.com/questions/715 /...). -Castagna Apr 15 at 6: 27

3 answersoldest newest votes

7

I can't comment specifically on hbase, But I have implemented RDF storage for Cassandra which has a very similar bigtable-converted Red Data Model.

You basically have two options in how to store RDF data in wide-column databases like hbase and Cassandra: The resource-centric approach and the statement-centric approach.

In the statement-oriented approach, each RDF statement corresponds to a row key (for instance, a UUID) and containssubject,predicateAndobjectColumns. In Cassandra, each of these wowould be supercolumns that wowould then contain subcolumns suchtypeAndvalue, To differentiate between RDF literals, blank nodes and URIs. If you needed to support named graphs, each row cocould also havecontextColumn that woshould contain a list of the named graphs that the statement was part.

The above is a relatively simple mapping to implement but suffers from some problems, notably the fact that preventing the creation of duplicate statements (an important RDF semantic) means having to do a read before doing a write, which at least in Cassandra quickly becomes a performance bottleneck as writes are much faster than reads.

There are ways to work around und this problem, in particle by using content-addressable statement Identifiers (e.g. the SHA-1 of the canonicalized N-triples representation of each statement) as the row keys, but this in turn introduces other trade-offs such as no longer being able to version Statement data: every time Statement data changes, the old row needs to be deleted and a new one inserted with the new row key.

In view of the previous considerations, the resource-oriented approach is generally a better natural fit for Storing RDF data in wide-column databases. in this approach, each RDF subject/resource corresponds to a row key, and each RDF predicate/property corresponds to a column or supercolumn. keyspaces can be used to represent RDF repositories, and column families can be used to represent named graphs.

The main trade-off with the resource-based approach is that some statement-oriented operations become more complex and/or slower, notably those counting the total number of statements or querying for predicate or object terms without specifying a subject. to support matching T basic graph pattern matching, additionalPOS,OPS, Etc. indices may need to be created and maintained.

See RDF: Cassandra, my Cassandra storage adapter for RDF. RB, for a more detailed example of a resource-centric mapping from the RDF data model to a wide-column data model.

Link | flag Edited Apr 25 at 19: 21 Answered Apr 25 at 19: 13 Arto bendiken
202 ● 2 ● 5
Thank you for your detailed answer. In a "resource-oriented" approach, what do you do when you have multiple objects with the same property? -Castagna Apr 25 at 20:20
I take it you mean multiple object values for a given predicate on a participant subject/resource? With Cassandra this is relatively easy since predicates are represented by supercolumns, and each predicate value can be stored in a subcolumn of the supercolumn. with HBase, I suspect you 'd have to look at making use of the column's timestamp value (which can be any arbitrary integer, apparently) to differentiate between multiple object term values. this means that the resource-centric storage approach may be less appropriate for HBase than it is for Cassandra... -Arto Bendiken Apr 25 at 21:43
Are you using this just for storage of RDF or are you actively querying RDF with this? My concern with Cassandra is the need to create my own indexes if I then want to be able to do real querying over the data-Rob Vesse Oct 7 at 13:04

0

Given that hbase is designed for row based storage of data the simplest thing wocould be to have a gspo layout I. e.

Graph | Subject | Predicate | Object

In terms of actually reading and writing data you'll most likely need to look at whatever API you want to use to manipulate the RDF and implement the necessary interfaces/classes that allows your chosen API get data in and out of HBase.

As for querying then you'll need to implement a SPARQL engine for your database which cocould be rather complex or use an API that does SPARQL in-memory but this will have the overhead of needing read out some/all of the data from HBase first.

Personally I don't know enough about HBase to say whether such an approach is viable or sensible, if you have hardware on which you can install and run HBase then you wocould most likely be better off installing and running a Triple Store for your RDF data.

Link | flag Answered Apr 13 at 9: 41Rob Vesse
1,766 ● 3 ● 11

0

Hey Arto or vesse, why not use Cassandra/hbase as a secondary storage mechanisms and the Jena graphmemfaster as a write-through cache? In the Cassandra case you have to use twice as much memory but it wocould solve your performance problems. Alternatively you cocould implement a hexastore in Cassandra if you can afford the memory costs.

Link | flag Edited Aug 26 at 7: 23 Answered Aug 26 at 7: 08 Alexi
1 ● 1
Such solutions don't scale to datasets larger than memory, nor beyond a single node, negating much of the point of using Cassandra or HBase. and, as you say, it wocould use at least twice the memory even for use cases where you cocould fit it all on a single node. as I outlined above, the resource-centric approach is simpler in several ways, at least for Cassandra. -Arto Bendiken Aug 27 at 0: 51
You're making the assumption here that everyone uses Jena. main issue with widetable nosql stores is that you have to do value indexing yourself which means they may be very good for just/reading RDF but they aren't so good for querying-Rob Vesse Oct 7 at 13:09

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.