Some nosql-related content was found in my previous notes during the holiday. It was my previous notes on reading Big Data glossary.
Horizontal or vertical scalingThere are two directions for database Scaling: Vertical Scaling-for better machines
Horizontal scaling-Add the same machine
How to determine the machine where data is distributed? That is, the sharding policy.
ShardingData is relatively flat distributed on each node. You can use the end of a number or perform the remainder operation. However, if you increase the number of machines, you need to re-arrange the data on a large scale.
To eliminate the pain of data distribution, we need more complex data distribution schemes to split data.
Some depend on the central directory, which determines the location of the key value. When a shard grows too large,
This indirect guidance allows data to be transferred between machines at the cost of each operation
In the central directory, the directory information is usually very small and static. It is usually stored in the memory, and occasionally changed. Another solution is consistent hash consistent hashing. this technique uses small tables to divide the hash values that may be used. A shard corresponds to a value.
Impact of the sharding model on usThe big data processing is built on the horizontal scaling model, which brings about the distributed processing of massive data. In some respects, there will be compromise writing distributed data handling code is tricky and involves tradeoffs between speed, scalability, fault tolerance, and traditional database goals like atomicity and consistency. more than that, there will also be changes in the way data is used: data is not necessarily on the same physical machine, and data retrieval and calculation will become a new problem.
Nosql
Does nosql really have no schema?In theory, each record cocould contain a completely different set of named values, though in practice, the application layer often relies on an informal schema, with the client code expecting certain named values to be present. the traditional k/V cache does not support complex queries. nosql is enhanced based on pure K/V to transfer the responsibilities of such common operations from developers to databases.
HadoopIs the best-known public system for running mapreduce algorithms, but has modern databases, such as MongoDB, also support it as an option. it's worthwhile even in a fairly traditional system, since if you can write your query in a mapreduce form, you'll be able to run it efficiently on as your machines as you have available.
MongoDBAutomatic sharding and mapreduce operations.
Features: javascript with a josn-like structure
Advantage: some commercial companies support automatic sharding mapreduceCouchdbFeatures: query using JS mapreduce
Use a multi-version concurrency control policy (the client needs to handle write conflicts and perform periodic garbage collection to remove old data)
Disadvantage: There is no built-in horizontal scaling solution, but there is an external solution.CassandraIt originated from Facebook's internal project and became a standard distributed database solution. it is worth the time to learn such a complex system for powerful functionality and flexibility traditionally, it was a long struggle just to set up a working cluster, but as the project matures, that has become a lot easier. consistent hash solves the fragmentation problem. The data structure is optimized for consistent writing, at the price of occasional slow reading.
Feature: the number of nodes required for reading/writing
Control the consistency level and choose between consistency and speedRedisTwo features make redis stand out: it keeps the entire database in ram, and its values can be complex data structures. advantage: You can use a cluster to process massive data with complex data structures. However, currently, sharding is implemented through the client.BigtableBigtable is only available to developers outside Google as the foundation of the App Engine datastore. Despite that, as one of the pioneering alternative databases, it's worth looking.HbaseHbase was designed as an open source clone of Google's bigtable, So unsurprisingly it has a very similar interface, and it relies on a clone of the Google file system called HDFS.
Hypertable
Hypertable is another open source clone of bigtable.
Voldemort
An open source clone of Amazon's dynamo database created by LinkedIn, Voldemort has a classic three-operation key/value interface, but with a sophisticated backend architecture to handle running on large distributed clusters.
It uses consistent hashing to allow fast lookups of the storage locations for fig keys, and it has versioning control to handle inconsistent values.
Riak
Riak was got red by Amazon's dynamo database, and it offers a key/value interface and is designed to run on large distributed clusters.
It also uses consistent hashing and a gossip protocol to avoid the need for the kind of centralized Index Server that bigtable requires, along with versioning to handle update conflicts. querying is handled using mapreduce functions written in either
ErlangOr Javascript. It's open source under an Apache license, but there's also a closed source has cial version with some special features designed for enterprise MERs.
ZookeeperThe Zookeeper framework was originally built at Yahoo! To make it easy for the company's applications to access configuration infor mation in a robust and easy-to-understand way, but it has since grown to offer a lot of features that help coordinate work across SS distributed clusters. one way to think of it is as a very specialized key/value store, with an interface that looks a lot like a filesystem and supports operations like watching callbacks, write consensus, and Transaction IDs that are often needed for coordinating distributed algorithms.
This has allowed it to act as a foundation layer for services like LinkedIn's Norbert, a flexible framework for managing clusters of machines. zookeeper itself is built to run in a distributed way too ss a number of machines, and it's designed to offer very fast reads, at the expense of writes that get slower the more servers are used to host the service.
Storage
S3Amazon's S3 service lets you store large chunks of data on an online service, with an interface that makes it easy to retrieve the data over the standard web protocol, HTTP. one way of looking at it is as a file system that's missing some features like appending, rewriting or renaming files, and true directory trees. you can also see it as a key/value database available as a web service and optimized for storing large amounts of data in each value.
Http://www.ibm.com/developerworks/cn/java/j-s3/
HDFSHDFS popular science content http://baike.baidu.com/view/3061630.htmNoSQLfan on HDFS data http://blog.nosqlfan.com/tags/hdfs Big Data calculation getting the concise, valuable information you want from a sea of data can be challenging, but there's been a lot of SS around systems that help you turn your datasets into something that makes sense. because there are so different barriers, the tools range from rapid statistical Analysis Systems to enlisting human helpers. r Yahoo! Pipes Mechanical Turk SOLR/Lucene elasticsearch bigsheets tinkerpop
NLPNatural Language Processing (NLP) is a subset of data processing that's so crucial, it earned its own section. its focus is taking messy, human-created text and extracting meaningful information. as you can imagine, this chaotic problem domain has spawned a large variety of approaches, with each tool most useful for particle kinds of text. there's no magic bullet that will understand written information as well as a human, but if you're prepared to adapt your use of the results to handle some errors and don't expect CT miracles, you can pull out some powerful insights.
- Natural Language Toolkit
- Opennlp
- Boilerpipe
- Opencalais
Map reduceThe approach pioneered by Google, and adopted by your other web companies, is to instead create a pipeline that reads and writes to arbitrary file formats, with intermediate results being passed between stages as files,
With the computation spread into SS worker machines.
Hadoop hive pig cascading cascalog mrjob caffeine S4 mapr acunu flume Kafka Azkaban oozie greenplum
Machine Learning
WEKAWEKA is a Java-based framework and GUI for machine learning algorithms. It provides a plug-in architecture for researchers to add their own techniques, with a command-
Line and window interface that makes it easy to apply them to your own data.
Mahout
Mahout is an open source framework that can run common machine learning algorithms on massive datasets.
Scikits. Learn
It's hard to find good off-the-shelf tools for practical machine learning. it's a beautifully extends ented and easy-to-use Python package offering a high-level interface to lower standard machine learning techniques. this makes it a very fruitful sandbox for experimentation and rapid prototyping, with a very easy path to using the same code in production once it's working well. amazon excellence address: http://www.amazon.cn/Big-Data-Glossary-Warden-Pete/dp/1449314597/qid=1333609610&sr=8-1# update, below is a reply to colleagues mail, answer nosql several questions:
Over the weekend, I sorted out the nosql materials and tried to answer several questions that everyone was concerned about nosql. Since each point had a lot of content, I would like to give a simple answer first, detailed sorting is provided later;
Simple answer:
- Q:Why nosql? Why should nosql be used for relational databases?
A:Stores many-to-many relationships, causing rapid data expansion. When using relational databases, you need to use database shards and table shards to solve storage problems;
If you want to further calculate the relationship, it will be a big computing task. In nosql, you can use the mapreduce solution;
- Q:Since we have already used redis, why should we develop a MongoDB solution?
A:Redis is essentiallyKey-Data StructureDoes not support query of complex conditions or mapreduce;
Redis is a product positioned as a memory database. If we put all user behavior data in the memory, there is a significant problem: pay for cold data.
Although the memory is already very cheap, hot data is stored in the memory and cold data is stored on the disk.
- Q:It is said that MongoDB has excellent performance. How can this problem be achieved?
A:This idea is often compared with relational databases. The underlying reason is that the theories behind the two are different;
Relational databases must support complex SQL statements, strict relational models, and acid-level transactions;
Nosql performs drastic subtraction on the basis of relational databases. SQL is only optional and does not support strong transactions. The theory guiding its design is the CAP theory of distributed systems.
The complexity of the system and strong transaction performance have been weakened to improve performance! It can be said that this is an unfair competition: A contestant wearing a shark skin swimsuit and a contestant wearing a cotton coat compete for swimming;
- Q:We are determined to be more entangled in sharding. How can we practice sharding in our project?
A:Sharding will greatly increase the complexity of the current system. I do not recommend sharding at the beginning of the release. If you want read/write splitting, you can use a master-slave replication cluster. If you only want to back up data to avoid single points of failure, you can configure a "Replica set" cluster. If you only want to reduce the pressure on MongoDB, you can use the memory cache before;
Signal for introducing shards:
(1) Insufficient machine disks (2) Write requirements cannot be supported by a single node
The automatic sharding feature of MongoDB is still a "nice to see" feature, and it is strongly opposed to using it in the production environment;
- Q:In your opinion, how should we design a supportive architecture?A:The UGC routes are supported and the relationships between users are enhanced. Various complex relational storage and operations are not suitable for redis and SQL Server. therefore, I suggest using SQL Server and other relational databases to store raw data at the underlying layer, using MongoDB to store relational data and redundant resource data in the middle, and using redis as the memory cache at the upper layer;
Detailed analysis:
The most commonly used relational databases actually contain two important components: SQL and relational data models based on the collection theory. The features are as follows:
- To support SQL, you need to provide a complex system, even if you only use the simplest functions. The cost investment is similar to the bandwidth we buy, although the bandwidth is not large in most cases, however, the purchase of bandwidth is still based on the peak; we only use the primary key to access data, but the relational database still needs to provide the underlying support according to the fully supported SQL theory;
- The relational data model is very strict. When OOP is prevalent, this strict constraint is a little more convenient: developers can directly map business entities to DB tables.
- When the capacity of a single-host database reaches the upper limit, it is very difficult to expand. It is often necessary to split tables based on the primary key. In fact, once the table is split, it is already in violation of the relational database paradigm, because "data in the same set is split into multiple tables"
- Relational databases generally support acid transactions, namely:AAtomicity-either all or no executionCConsistency: data consistency during transaction executionIIsolation: Two transactions do not affect each otherDPersistence: Once the transaction is completed, the results should be persisted to the disk.
The turning point from relational databases to nosql is actually at. When data began to be distributed and stored, relational databases gradually evolved into a primary key-dependent Query System, which is almost a common feature of all nosql products; summarize the commonalities of most nosql products:
- Supporting SQL is no longer a required option, instead of a simple key-value Access Model
- On the basis of relational databases, you can perform drastic subtraction. For example, transactions are not supported. nosql products focus much more on performance than acid, and generally only provide row-level atomic operations, that is to say, operations on the same key are performed in serial mode to ensure that data is not damaged. Such a policy can be used in most scenarios. The key is that it can improve the execution efficiency greatly!
- Nosql products tend to converge in design, and are generally restrained from adding new features to avoid returning to the old path of relational databases.
Nosql products are designed based on distributed systems.
CapTheory:
CConsistency: whether the data copies of all nodes in the distributed system are consistent at the same time.
AAvailability: whether the external service can be normally performed when a node in the cluster encounters a problem
PPartition adequacy: When a node in the cluster is out of contact, whether the service can be normally provided to the outside world, and all distributed systems can only support the above two, due to network latency and other problems,
PMust be supported. Therefore, you must select consistency and availability. Obviously, to maintain data consistency among all nodes, you must check the data consistency of all nodes before you can determine that the operation is successful, obviously, availability cannot be guaranteed after a node goes down. nosql products can be divided into the following categories:
- Key-Value TypeThe value type is random, such as Voldemort.
- Key-Data StructureValue can be a richer data structure, such as redis
- Key-Document TypeValue is a document and is generally stored in a JSON-like structure, such as MongoDB couchdb.
What is the problem solved by sharding?
The sharding technology is actually the technology that distributes data and read/write requests on multiple machines (or nodes.
The same piece of data is stored on multiple nodes and there is data redundancy. The goal of the sharding policy is to minimize the cost of data filling and migration when nodes are increased or decreased;
Common sharding policies include: (1) Consistent Hash (2) use the control module + route table to control data display
Sharding will greatly increase the complexity of the system. In the case of a small amount of data, you can deal with it by increasing the memory cache layer or simply separating read/write.