Big Data Day: Architecture and algorithmic notes

Last Update:2014-10-27 Source: Internet

Author: User

Tags ack message queue

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Big Data Day Knowledge: architecture and AlgorithmsJump to: Navigation, search

Directory

1 What we're talking about when it comes to big data
2 data fragmentation and routing
3 data replication and consistency
4 common algorithms and data structures for big data
5 cluster resource management and scheduling
6 Distributed Coordination System
7 Distributed Communication
8 Data channel
9 Distributed File System
Ten Memory kv
One Column Database
A Large Batch processing
- Flow-based computing
- Interactive Data Analysis
the Graph Database
- machine Learning: Paradigm and Architecture
- machine Learning: Distributed algorithms *
- Incremental Calculation
+ Appendix A hardware architecture and common performance indicators
- Appendix B Big Data must-read literature

What we're talking about when it comes to big data

IBM 3V (volume, speed, form) + value
P7 uses social networks to predict the Dow Jones Indices by analysing public sentiment in Twitter.

data fragmentation and routing

Membase (couchbase): "Virtual bucket"
DHT consistent Hash
1. Dynamo "Virtual Node"

data replication and consistency

CAP:CP or AP?
ACID
BASE
1. Soft state = Intermediate state between a state/stateless state?
Consistency Model classification
1. Strong: All processes see the latest values immediately after a write operation?
2. Final: "Inconsistent window" (Can this time fragment be guaranteed?) Otherwise, it's a ghost.)
  1. Monotonic reads: If you read a version of V of the data, then all subsequent operations cannot see the older version of V (How do I define this ' follow-up '? ）
  2. Monotone write: Guaranteed serialization of multiple write operations?
3. Causal
  1. "Read what you've written."
    1. Session
Replica Update policy
Conformance protocol
1. 2PC: Coordinator/Participant
  1. 3PC: Solve the problem of 2PC with long-time blocking, divide the submission stage into pre-commit and commit
2. Vector clock
  1. Used to determine whether there is a causal relationship between events
3. RWN (Data consistency: r+w>n)
4. Paxos
  1. Ensure consistency of log copy data?
5. Raft
  1. 3 Sub-issues: Leader election, log copy, security
  2. Term?

common algorithms and data structures for big data

Bloom Filter
1. Count BF: The basic information Unit is represented by multiple bit bits
Skiplist: Random Find/Insert?
LSM tree: Converting a large number of random writes into batch sequence writes
1. e.g. LevelDB
Merkle Hash tree (BitTorrent? Saharan
LZSS
1. Snappy: Match length >=4
Cuckoo Hash
1. With 2 hash functions, if 2 buckets are not empty, then the old element is kicked out, and the old element is re-executed with an infinite loop? Re-select the hash function
2. Application: CMU Silt

cluster resource management and scheduling

Scheduling System Paradigm
1. Centralized type
2. Two-stage
3. State sharing
Resource Scheduling policy
1. Fifo
2. Fair
3. Ability
4. Delay
5. Master Resource Equity (DRF): Maximizes the amount of resources currently allocated to the fewest resources of the user
Mesos: Level Two scheduling
YARN: Support preemption

Distributed Coordination System

Chubby Lock Service
1. p93 If there is no fault, in general, the system will try to give the lease to the original master server
2. KeepAlive mechanism
3. Each "Chubby unit" master server stores memory snapshots to another data center, avoiding cyclic dependencies?
ZooKeeper
1. May read expired data and sync operation before read
2. Replay log combined with fuzzy snapshot (Snapshot)?
3. Znode: Persistent/Temporary

Distributed Communication

Serialization and RPC Framework
1. PB and Thrift
2. Avro: Describe IDL with JSON?
Message Queuing
1. ZeroMQ (lightweight, message persistence not supported) > Kafka (at least once) > RabbitMQ > ActiveMQ?
  1. ISR (In-sync replicas)
Application-Layer multicasting
1. Gossip
  1. Inverse entropy model (random flood?) ): Push/pull/push-pull
  2. p117 If the node p is notified that Q has been updated to understand that "confession is rejected", then the spread of rumors model can be understood as: the number of rejected the more the more silent, to the end of the full stop to vindicate. Cons: There is no guarantee that all nodes will eventually get updated (well, the biggest match is not the pursuit of goals!) ）
  3. Application: Cassandra Cluster Management

Data channel

Log collection: Chukwa Scribe
Data bus: Databus Wormhole
Import and export: Sqoop?

Distributed File System

Gfs
1. Next Generation Colossus?
Hdfs
1. Ha Scenario: ANN/SNN
HayStack: Merging small picture data, reducing metadata properties
Storage layouts: Row-and-column blending
1. Dremel columnstore: Name.Language.Code? (data item, repeating layer, defining layer)?
2. Hybrid storage: Rcfile, Orcfile, parquet
Erasure Code/mds*
1. Reed-solomon
2. LRC
  1. Block locality and minimum code spacing: for MDS configured (N,M), >=n and m+1 are respectively

Memory kv

Ramcloud
Redis
Membase

Column Database

BigTable
PNUTS
Megastore
Spanner
1. TrueTime:TT.now () Returns a time interval to ensure that the time at which the event actually occurs must fall within this interval?

Large Batch processing

MapReduce
1. Map Task divides the intermediate data into r parts and then notifies the reduce task to fetch (only all map tasks are completed and reduce can be started?).
  1. reduce pull, not map-side push, to support fine-grained fault tolerance (incisive!!) In effect, the synchronous blocking mode becomes an asynchronous trigger)
  2. The reduce task converts the intermediate result key into <key, list<value>> to the user-defined reduce function
2. Optional combiner: The map side merges value of the same key, reducing network traffic
3. calculation mode
  1. request and
  2. filtering
  3. Organization data (partition policy design)
  4. Join
    1. reduce-side (note that the key for 2 data sets is the same , the difference is the type of value, reducer need to make a distinction)
      1. This place is still a bit confused, Reducer received too much data, memory can not be installed to do?
    2. map-side (assuming L big R small, left-connected?) , the R is fully read into memory)
DAG
1. Dryad
  1. Diagram Structure description /strong> (Distributed Computing Framework: How to describe automatically create individual compute nodes and topology connections based on a global diagram?) )

Flow-based computing

System architecture
1. Master-Slave: Storm
2. P2p:s4
3. Samza
Dag topology (the topology here is similar to DirectShow filtergraph)
1. COMPUTE nodes
2. Data flow: Millwheel (Key, Value, TimeStamp), Storm (Tuple,...) ; S4 [Key, Attributes]
Delivery Guarantee
1. Storm "Delivery Once": XOR
2. Millwheel: By state persistence (equivalent to static local variable in C function ...) ）
Persistence of State
1. Millwheel and Samza: Weak mode (Node B only receives ACK from downstream node C to send an ACK to upstream a)
  1. = = If C does not respond in a timely manner, B performs a state persistence to get rid of the dependence on C
  2. SAMZA: A compute node can use its status information as a message queue for Kafka

Interactive Data Analysis

Hive
1. Stinger improvements: Vector query engine, CBO
Shark
1. "Partial dag Execution Engine", which is essentially a dynamic optimization of SQL queries
2. Data co-shard: The column that will be joined by hashing the different records of the same key into the same machine, subsequent to avoid network transmission overhead such as shuffle
Dremel Department
1. Dremel: Instead of converting user queries to Mr Tasks, the class MPP mechanism directly scans the data stored on disk (Advanced compilation technology?). ）
2. Powerdrill: Load the data to be analyzed into memory? Skipping irrelevant data with a clever data structure?
3. Impala
  1. p262 Impalad uses C + + encoding to bypass Namenode to read HDFs directly, and query execution takes LLVM native code generation (nb! ）
  2. Although it looks good, it still needs to be improved (fault tolerance, UDF)
  3. Operators: Scan, Hashjoin, Hashaggregateion, Union, TopN, Exchange
4. Presto (similar to Impala)
Mixing system: hadoopdb

Graph Database

Online Query class
1. Three-layer structure
2. Facebook TAO
  1. * Data Consistency
Common graph mining problems
1. PageRank
2. Single Source Shortest Path
3. Two maximum matching of graphs
Figure Data Fragmentation
1. Benche/Point cutting: Optimization problems, in fact, are random segmentation?
Computational models
1. Node-centric
2. Gas (Collection-application-distribution)
3. Synchronous execution
  1. Bsp
  2. MapReduce (iterative iterations require multiple output of intermediate results to the file system, affecting system efficiency)
4. Asynchronous execution
  1. Data consistency: full > Edge > vertex; sequence consistency (additional constraints)
Figure database:pregel giraph (map-only, Netty) Graphchi (standalone, parallel sliding window PSW) powergraph (incremental cache)

machine Learning: Paradigm and Architecture

Distributed machine Learning
1. Mapreduce
2. Bsp
  1. Each super step: Distributed Computing > Global Communication > Roadblock synchronization
3. Ssp?
Spark and Mlbase
Parameter Server *

machine Learning: Distributed algorithms *

Logistic regression
Parallel random gradient descent
Matrix decomposition: ALS-WR
Lambdamart
Spectral clustering
Deep Learning: Distbelief

Incremental Calculation

Percolator
1. p371 "Snapshot Isolation" to resolve write conflicts?
Kineograph
Dryadinc

Appendix A hardware architecture and common performance indicatorsAppendix B Big Data must-read literature

Big Data Day: Architecture and algorithmic notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More