使用python訪問hbase

來源:互聯網
上載者:User

通過thrift,我們可以使用python訪問hbase。

 

關於thrift

thrift是一個跨語言服務的軟體開發架構(Thrift is a software framework for scalable cross-language services development.)。

它的官方網站是:http://incubator.apache.org/thrift/

 

下載thrift

svn co http://svn.apache.org/repos/asf/incubator/thrift/trunk thrift

 

安裝thritf(Linux)

cd thrift

./bootstrap.sh

./configure

make

make install

 

產生hbase的client代碼

cd $HBASE_HOME/src/java/org/apache/hadoop/hbase/thrift

thrift --gen py Hbase.thrift

然後將產生的gen-py檔案夾下的hbase檔案夾拷貝到

/usr/lib/python2.5/site-packages/

 

準備hbase

首先確認hbase正常工作,然後啟動hbase的thrift服務:

$HBASE_HOME//bin/hbase-deamon.sh start thrift

 

OK,準備工作到此為止,我們開始編寫python客戶程式。

 

假設我們需要一個表儲存從網上抓取下來的網頁。

表命名為"webpages"

它使用網頁的url反轉後作為行標識符,使用列組"contents:"(注意結尾的冒號)儲存網頁的內容。

 

匯入需要的模組:

from thrift import Thrift<br />from thrift.transport import TSocket<br />from thrift.transport import TTransport<br />from thrift.protocol import TBinaryProtocol</p><p>from hbase import Hbase<br />from hbase.ttypes import ColumnDescriptor, Mutation, BatchMutation, NotFound

 

建立與hbase的串連:

transport = TTransport.TBufferedTransport(<br /> TSocket.TSocket(netloc, port))<br />protocol = TBinaryProtocol.TBinaryProtocol(transport)<br />client = Hbase.Client(protocol)<br />transport.open()

建立表:

#只保留一個版本,使用BLOCK方式壓縮<br />#其他參數請參考hbase的API<br />contents=ColumnDescriptor(name="contents:", maxVersions=1, compression="BLOCK")<br />client.createTable(“webpages”,[contents,])<br />

寫入資料:

def write(url, content):<br /> row = self.reverseUrl(url)<br /> mutations = [Mutation(column="contents:", value=content)]<br /> client.mutateRow(“webpages”, row, mutations)

 

 

完整的代碼和單元測試如下:

from unittest import TestCase, main<br />from thrift import Thrift<br />from thrift.transport import TSocket<br />from thrift.transport import TTransport<br />from thrift.protocol import TBinaryProtocol</p><p>from hbase import Hbase<br />from hbase.ttypes import ColumnDescriptor, Mutation, BatchMutation, NotFound<br />class HbaseWriter:</p><p> def __init__(self, netloc, port, table="webpages"):<br /> self.tableName = table</p><p> self.transport = TTransport.TBufferedTransport(<br /> TSocket.TSocket(netloc, port))<br /> self.protocol = TBinaryProtocol.TBinaryProtocol(self.transport)<br /> self.client = Hbase.Client(self.protocol)<br /> self.transport.open()</p><p> tables = self.client.getTableNames()<br /> if self.tableName not in tables:<br /> self.__createTable()</p><p> def __del__(self):<br /> self.transport.close()</p><p> def __createTable(self):<br /> self.client.createTable(self.tableName,<br /> [ColumnDescriptor(name="contents:", maxVersions=1, compression="BLOCK"),])</p><p> def reverseUrl(self, url):<br /> link = filter(None, url.split("//"))[-1]<br /> hops = filter(None, link.split("/"))<br /> domain = hops[0].split(".")<br /> domain.reverse()<br /> domain = '.'.join(domain)<br /> hops[0] = domain<br /> return '/'.join(hops) </p><p> def write(self, url, content):<br /> row = self.reverseUrl(url)<br /> mutations = [Mutation(column="contents:", value=content)]<br /> self.client.mutateRow(self.tableName, row, mutations)</p><p>class TestHbaseWriter(TestCase):<br /> def setUp(self):<br /> self.writer = HbaseWriter("192.168.1.103", 9090, "test")</p><p> def tearDown(self):<br /> name = self.writer.tableName<br /> client = self.writer.client<br /> client.disableTable(name)<br /> client.deleteTable(name)</p><p> def testReverseUrl(self):<br /> self.assertEquals(self.writer.reverseUrl("http://www.a.com"), "com.a.www")<br /> self.assertEquals(self.writer.reverseUrl("http://www.a.com/"), "com.a.www")<br /> self.assertEquals(self.writer.reverseUrl("http://a.com"), "com.a")<br /> self.assertEquals(self.writer.reverseUrl("http://www.b.com/foo"), "com.b.www/foo")<br /> self.assertEquals(self.writer.reverseUrl("aaa.bbb.ccc.com.cn/foo1/foo2"), "cn.com.ccc.bbb.aaa/foo1/foo2")</p><p> def testCreate(self):<br /> tableName = self.writer.tableName<br /> client = self.writer.client<br /> self.assertTrue(self.writer.tableName in client.getTableNames())<br /> columns = dict()<br /> columns["contents"] = ColumnDescriptor(name="contents", maxVersions=1, compression="BLOCK")<br /> cds = client.getColumnDescriptors(tableName)<br /> for name,column in cds.items():<br /> self.assertTrue(column.name in columns)</p><p> def testWrite(self):<br /> tableName = self.writer.tableName<br /> client = self.writer.client<br /> data = {"http://www.a.com":"com.a.www",<br /> "http://www.a.com/bbb":"com.a.www/bbb",<br /> "http://www.foo.com/foo":"foo"}<br /> for url, content in data.items():<br /> self.writer.write(url, content)</p><p> scannerId = client.scannerOpen(tableName, "", ["contents:",])<br /> while True :<br /> try:<br /> result = client.scannerGet(scannerId)<br /> except NotFound:<br /> break<br /> row = result.row<br /> contents = result.columns["contents:"].value<br /> url = "http://" + self.writer.reverseUrl(row)<br /> self.assertTrue(url in data)<br /> self.assertEqual(data[url], contents)<br /> client.scannerClose(scannerId)</p><p>if __name__ == "__main__":<br /> main()<br />


 

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.