清明假期翻以前的筆記發現有一些NoSQL相關的內容,比較零散,是之前讀《Big Data Glossary》的筆記.簡單整理了一下,記錄於此.
Horizontal or Vertical Scaling資料庫擴充的方向有兩個:垂直擴充-換更牛的機器
水平擴充-增加同樣的機器
選擇水平擴充必然遇到的一個問題就是,如何決定資料分布在哪台機器上? 也就是分區策略
分區Sharding資料比較平坦的分布在各個節點上,可以使用數字結尾的方式或者取餘運算,但是一旦增加機器就要進行大規模的資料重排
要想消除資料分布之痛,就需要更複雜的資料分布schemes來切分資料.
有些依賴於中心目錄,它決定了key值對應的位置.當某個片段增長過大的時候,
這種間接的指導允許資料在機器之間轉移,這種做法的代價就是每一個操作都會去
中心目錄裡面去查詢一下 目錄資訊通常非常小 也都是靜態 一般都會放在記憶體裡面,偶爾變動一下另外一種方案就是一致性雜湊consistent hashing.這種技術使用小表把可能用到的雜湊值分範圍.一個片段對應一個值
分區模型對我們的影響大資料的處理構建在水平擴充模型上,帶來的問題就是海量資料的分散式處理,會在某些方面存在妥協Writing distributed data handling code is tricky and involves tradeoffs between speed, scalability, fault tolerance, and traditional database goals like atomicity and consistency.不僅僅這些,還有就是資料的使用方式也會有變化:資料不一定在同一物理機器上,取資料和資料運算都會成為新問題.
NoSQL
NoSQL真的沒有Schema嗎? In theory, each record could contain a completely different set of named values,though in practice, the application layer often relies on an informal schema, with the client code expecting certain named values to be present.傳統的K/V緩衝,缺少對複雜情況的查詢,NoSQL在純K/V的基礎上做的強化,把這種常用操作的實現職責從開發人員轉移到資料庫.
Hadoop is the best-known public system for running MapReduce algorithms, but many modern databases, such as MongoDB, also support it as an option. It’s worthwhile even in a fairly traditional system, since if you can write your query in a MapReduce form, you’ll be able to run it efficiently on as many machines as you have available.
MongoDB automatic sharding and MapReduce operations.
特點:類JOSN結構 javascript
優勢:有商業公司支援 支援自動分區 MapReduce CouchDB特點:查詢使用js MapReduce
使用多版本的並發控制策略(用戶端需要處理寫衝突並要進行周期性的記憶體回收來移除舊資料)
缺點:沒有內建水平擴充的解決方案 但有外部的解決方案 Cassandra 源自Facebook內部項目,成為標準的分散式資料庫方案. 值得花時間學習這樣一個複雜的系統以獲得強大的功能和靈活性 Traditionally, it was a long struggle just to set up a working cluster, but as the project matures, that has become a lot easier. 一致性雜湊解決片段問題資料結構針對一致性寫做了最佳化,代價是偶爾的讀慢
特性:需要多少給節點一直才可以讀/寫
控制一致性等級,在一致性和速度之間做取捨 Redis Two features make Redis stand out: it keeps the entire database in RAM, and its values can be complex data structures.優勢:處理複雜資料結構的能力你可以通過叢集方式來處理海量資料但是目前,sharding都是通過用戶端實現的. BigTableBigTable is only available to developers outside Google as the foundation of the App Engine datastore. Despite that, as one of the pioneering alternative databases, it’s worth looking at. HBaseHBase was designed as an open source clone of Google’s BigTable, so unsurprisingly it has a very similar interface, and it relies on a clone of the Google File System called HDFS.
Hypertable
Hypertable is another open source clone of BigTable.
Voldemort
An open source clone of Amazon’s Dynamo database created by LinkedIn, Voldemort has a classic three-operation key/value interface, but with a sophisticated backend architecture to handle running on large distributed clusters.
It uses consistent hashing to allow fast lookups of the storage locations for particular keys, and it has versioning control to handle inconsistent values.
Riak
Riak was inspired by Amazon’s Dynamo database, and it offers a key/value interface and is designed to run on large distributed clusters.
It also uses consistent hashing and a gossip protocol to avoid the need for the kind of centralized index server that BigTable requires, along with versioning to handle update conflicts. Querying is handled using MapReduce functions written in either
Erlang or JavaScript. It’s open source under an Apache license, but there’s also a closed source commercial version with some special features designed for enterprise customers.
ZooKeeper The ZooKeeper framework was originally built at Yahoo! to make it easy for the company’s applications to access configuration infor mation in a robust and easy-to-understand way, but it has since grown to offer a lot of features that help coordinate work across distributed clusters. One way to think of it is as a very specialized key/value store, with an interface that looks a lot like a filesystem and supports operations like watching callbacks, write consensus, and transaction IDs that are often needed for coordinating distributed algorithms.
This has allowed it to act as a foundation layer for services like LinkedIn’s Norbert, a flexible framework for managing clusters of machines. ZooKeeper itself is built to run in a distributed way across a number of machines, and it’s designed to offer very fast reads, at the expense of writes that get slower the more servers are used to host the service.
Storage
S3Amazon’s S3 service lets you store large chunks of data on an online service, with an interface that makes it easy to retrieve the data over the standard web protocol, HTTP. One way of looking at it is as a file system that’s missing some features like appending,rewriting or renaming files, and true directory trees. You can also see it as a key/value database available as a web service and optimized for storing large amounts of data in each value.
http://www.ibm.com/developerworks/cn/java/j-s3/
HDFS HDFS科普內容 http://baike.baidu.com/view/3061630.htmNoSQLfan 上關於HDFS的資料 http://blog.nosqlfan.com/tags/hdfs 大資料的計算 Getting the concise, valuable information you want from a sea of data can be challenging, but there’s been a lot of progress around systems that help you turn your datasets into something that makes sense. Because there are so many different barriers, the tools range from rapid statistical analysis systems to enlisting human helpers. R Yahoo! Pipes Mechanical Turk Solr/Lucene Elasticsearch BigSheets Tinkerpop
NLPNatural language processing (NLP) is a subset of data processing that’s so crucial, it earned its own section. Its focus is taking messy, human-created text and extracting meaningful information. As you can imagine, this chaotic problem domain has spawned a large variety of approaches, with each tool most useful for particular kinds of text. There’s no magic bullet that will understand written information as well as a human, but if you’re prepared to adapt your use of the results to handle some errors and don’t expect miracles, you can pull out some powerful insights.
- Natural Language Toolkit
- OpenNLP
- Boilerpipe
- OpenCalais
Map Reduce The approach pioneered by Google, and adopted by many other web companies, is to instead create a pipeline that reads and writes to arbitrary file formats, with intermediate results being passed between stages as files,
with the computation spread across many machines.
Hadoop Hive Pig Cascading Cascalog mrjob Caffeine S4 MapR Acunu Flume Kafka Azkaban Oozie Greenplum
Machine Learning
WEKAWEKA is a Java-based framework and GUI for machine learning algorithms. It provides a plug-in architecture for researchers to add their own techniques, with a command-
line and window interface that makes it easy to apply them to your own data.
Mahout
Mahout is an open source framework that can run common machine learning algorithms on massive datasets.
scikits.learn
It’s hard to find good off-the-shelf tools for practical machine learning.It’s a beautifully documented and easy-to-use Python package offering a high-level interface to many standard machine learning techniques.This makes it a very fruitful sandbox for experimentation and rapid prototyping, with a very easy path to using the same code in production once it’s working well. 卓越亞馬遜地址:http://www.amazon.cn/Big-Data-Glossary-Warden-Pete/dp/1449314597/qid=1333609610&sr=8-1# 2012-8-18更新,下面是回複同事郵件,解答Nosql的幾個疑問:
周末我把手頭的Nosql資料梳理了一下,嘗試回答一下關於Nosql大家比較關心的幾個問題,由於每一個點展開都有很多內容,我首先給出一個簡單的回答,後面有詳細的梳理;
簡單回答:
- Q: Why Nosql?既然有關係型資料庫為什麼要用Nosql?
A:儲存多對多的關係,資料膨脹速度非常快;使用關係型資料庫,需要使用分庫,分表的方式解決儲存問題;
如果要對關係進行進一步的運算,將是一個大運算量的任務;在Nosql中可以使用MapReduce的方案來實現;
- Q:既然我們已經使用了Redis,為什麼還要搞一個MongoDB的方案出來?
A:redis本質上是一個Key-資料結構的記憶體資料庫,不支援對複雜條件的查詢,不支援MapReduce;
而且Redis是產品定位是記憶體資料庫,如果我們把使用者行為資料全放在記憶體,這樣就有顯著的問題:為冷資料買單
雖然記憶體已經很便宜,還是把熱資料放在記憶體,冷資料放磁碟
- Q: 據說MongoDB效能很牛,怎麼做到的?
A: 這種觀點往往是和關係型資料庫比較出來的;本質原因是兩者背後的理論不同;
關係型資料庫要支援複雜的SQL,嚴格的關聯式模式,ACID層級的強事務;
Nosql在關係型資料庫的基礎上大刀闊斧的做減法,支援SQL僅僅是可選,不支援強事務;指導其設計的理論是分布式系統的CAP理論
弱化了複雜的系統和強事務性換來了可觀的效能提升!可以說這是一個不公平的比賽:一個穿著鯊魚皮泳衣選手和一個穿著棉衣的選手比賽遊泳;
- Q: 立志比較糾結分區,在我們的項目裡面如何實踐分區?
A: 分區會大大增加現在系統複雜度,我不建議在上線之初就搞分區;如果要讀寫分離可以使用主從複製叢集,如果只是想備份資料避免單點,可以配置"複本集"叢集;如果僅僅是想減輕MongoDB的壓力,可以前面使用記憶體緩衝;
引入分區的訊號:
(1) 機器磁碟不夠用 (2)單節點不能支撐寫需求
MongoDB的自動分區目前還是一個"看上去很美"的功能,強烈反對在生產環境中使用;
- Q: 你認為捧腹的架構應該如何設計?A: 捧腹走的是UGC路線,並會增強使用者之間的關係;各種複雜的關係儲存和運算,是不適合使用Redis和SQL Server的.所以我建議底層使用sql server等關係型資料庫儲未經處理資料,中間使用MongoDB儲存關係資料和資源冗餘資料,上層使用Redis做記憶體緩衝;
詳細分析:
我們平時使用最多的關係型資料庫實際上包含兩個重要的組成部分:基於集合理論的SQL和關係型資料模型;其特點是:
- 支援SQL需要提供一個複雜的系統,即使你僅僅使用最簡單的功能;這種成本投入很像我們購買頻寬,雖然絕大部分時候頻寬不多,但是頻寬的購買還是要按照峰值進行;我們僅僅使用主鍵存取資料,但是關係型資料庫依然要按照完全支援SQL理論提供底層支援;
- 關係型資料模型是非常嚴格的,在OOP盛行的情況下,這種嚴格的約束還會有一點便利:開發人員可以把業務實體直接映射到DB表
- 關係型資料庫在單機容量達到上限的時候,做擴充是非常難的,往往要要根據主鍵進行分表;其實可以想到一旦分表之後,就已經開始違反關係型資料庫的範式了,因為"同一個集合的資料被拆分到多個表"
- 關係型資料庫一般支援ACID強事務,即:A原子性-要麼都全部執行要麼不執行 C一致性:事務執行過程中資料一致 I隔離:兩個事務互不影響 D持久化:一旦事務完成就應把結果持久化到磁碟
關係型資料庫到Nosql的轉折點其實就在第3點,當資料開始分布儲存的時候,關係型資料庫逐漸演變成依賴主鍵的查詢系統,這幾乎是所有Nosql產品共同的特徵;總結一下大部分Nosql產品的共同點:
- 支援SQL不再是必選項,取而代之的是簡單的Key-Value存模數型
- 在關係型資料庫的基礎上大刀闊斧的做減法,比如不支援事務;Nosql產品對效能的關注遠遠超過ACID,往往只提供行層級的原子性操作,即對同一個key的操作操作會是串列執行,保證資料不會損壞;這樣的策略可以應對大多數情境,關鍵的是它可以帶來非常可觀的執行效率提升!
- Nosql產品在設計上比較收斂,一般比較克制增加新功能的加入,避免回到關係型資料庫的老路上
Nosql產品的設計依據的是分布式系統的
CAP理論:
C 一致性:在同一時刻,分布式系統中所有節點的資料副本是否一致
A 可用性:叢集中一個節點出現問題的時候,是否還可以正常對外服務
P 分區容忍性:當叢集中的某個節點失去聯絡的時候,是否還可以正常對外服務 而且所有的分布式系統只可能支援上面的兩條,由於網路延遲等問題,
P是必須要支援的;所以就要在一致性和可用性上做選擇;顯然要保持所有節點資料一致就要在檢查所有節點資料一致之後才可以判定操作成功,這樣顯然在一個節點宕掉之後就無法保證可用性; Nosql產品大致可以分成下面幾類:
- Key-Value型 value實值型別隨意類型,比如Voldemort
- Key-資料結構型 value可以是更為豐富的資料結構,比如Redis
- Key-document型 value是文檔,一般使用類似於Json的結構儲存,比如MongoDB CouchDB
分區解決的是什麼問題?
分區技術實際上就是將資料和讀寫請求在多個機器(或節點)上分配的技術.
同一條資料會儲存在多個節點上,存在資料冗餘,分區策略追求的目標就是當節點增減的時候,資料填充和遷移的成本最小;
常見的分區策略有:(1) 一致性雜湊 (2)想對資料有一個顯示的控制,利用控制模組+路由表
分區會導致系統的複雜性大大增加,在資料量不大的情況下,通過增加記憶體緩衝層或通過簡單的讀寫分離即可應對.