Big Data Day Knowledge: architecture and AlgorithmsJump to: Navigation, search
- 1 What we're talking about when it comes to big data
- 2 data fragmentation and routing
- 3 data replication and consistency
- 4 common algorithms and data structures for big data
- 5 cluster resource management and scheduling
- 6 Distributed Coordination System
- 7 Distributed Communication
- 8 Data channel
- 9 Distributed File System
- Ten Memory kv
- One Column Database
- A Large Batch processing
- - Flow-based computing
- - Interactive Data Analysis
- the Graph Database
- - machine Learning: Paradigm and Architecture
- - machine Learning: Distributed algorithms *
- - Incremental Calculation
- + Appendix A hardware architecture and common performance indicators
- - Appendix B Big Data must-read literature
What we're talking about when it comes to big data
- IBM 3V (volume, speed, form) + value
- P7 uses social networks to predict the Dow Jones Indices by analysing public sentiment in Twitter.
data fragmentation and routing
- Membase (couchbase): "Virtual bucket"
- DHT consistent Hash
- Dynamo "Virtual Node"
data replication and consistency
- CAP:CP or AP?
- Soft state = Intermediate state between a state/stateless state?
- Consistency Model classification
- Strong: All processes see the latest values immediately after a write operation?
- Final: "Inconsistent window" (Can this time fragment be guaranteed?) Otherwise, it's a ghost.)
- Monotonic reads: If you read a version of V of the data, then all subsequent operations cannot see the older version of V (How do I define this ' follow-up '? ）
- Monotone write: Guaranteed serialization of multiple write operations?
- "Read what you've written."
- Replica Update policy
- Conformance protocol
- 2PC: Coordinator/Participant
- 3PC: Solve the problem of 2PC with long-time blocking, divide the submission stage into pre-commit and commit
- Vector clock
- Used to determine whether there is a causal relationship between events
- RWN (Data consistency: r+w>n)
- Ensure consistency of log copy data?
- 3 Sub-issues: Leader election, log copy, security
common algorithms and data structures for big data
- Bloom Filter
- Count BF: The basic information Unit is represented by multiple bit bits
- Skiplist: Random Find/Insert?
- LSM tree: Converting a large number of random writes into batch sequence writes
- e.g. LevelDB
- Merkle Hash tree (BitTorrent? Saharan
- Snappy: Match length >=4
- Cuckoo Hash
- With 2 hash functions, if 2 buckets are not empty, then the old element is kicked out, and the old element is re-executed with an infinite loop? Re-select the hash function
- Application: CMU Silt
cluster resource management and scheduling
- Scheduling System Paradigm
- Centralized type
- State sharing
- Resource Scheduling policy
- Master Resource Equity (DRF): Maximizes the amount of resources currently allocated to the fewest resources of the user
- Mesos: Level Two scheduling
- YARN: Support preemption
Distributed Coordination System
- Chubby Lock Service
- p93 If there is no fault, in general, the system will try to give the lease to the original master server
- KeepAlive mechanism
- Each "Chubby unit" master server stores memory snapshots to another data center, avoiding cyclic dependencies?
- May read expired data and sync operation before read
- Replay log combined with fuzzy snapshot (Snapshot)?
- Znode: Persistent/Temporary
- Serialization and RPC Framework
- PB and Thrift
- Avro: Describe IDL with JSON?
- Message Queuing
- ZeroMQ (lightweight, message persistence not supported) > Kafka (at least once) > RabbitMQ > ActiveMQ?
- ISR (In-sync replicas)
- Application-Layer multicasting
- Inverse entropy model (random flood?) ): Push/pull/push-pull
- p117 If the node p is notified that Q has been updated to understand that "confession is rejected", then the spread of rumors model can be understood as: the number of rejected the more the more silent, to the end of the full stop to vindicate. Cons: There is no guarantee that all nodes will eventually get updated (well, the biggest match is not the pursuit of goals!) ）
- Application: Cassandra Cluster Management
- Log collection: Chukwa Scribe
- Data bus: Databus Wormhole
- Import and export: Sqoop?
Distributed File System
- Next Generation Colossus?
- Ha Scenario: ANN/SNN
- HayStack: Merging small picture data, reducing metadata properties
- Storage layouts: Row-and-column blending
- Dremel columnstore: Name.Language.Code? (data item, repeating layer, defining layer)?
- Hybrid storage: Rcfile, Orcfile, parquet
- Erasure Code/mds*
- Block locality and minimum code spacing: for MDS configured (N,M), >=n and m+1 are respectively
- TrueTime:TT.now () Returns a time interval to ensure that the time at which the event actually occurs must fall within this interval?
Large Batch processing
- Map Task divides the intermediate data into r parts and then notifies the reduce task to fetch (only all map tasks are completed and reduce can be started?).
- reduce pull, not map-side push, to support fine-grained fault tolerance (incisive!!) In effect, the synchronous blocking mode becomes an asynchronous trigger)
- The reduce task converts the intermediate result key into <key, list<value>> to the user-defined reduce function
li> Note that here user reduce operates on global data (which may involve remote access ...).
- Optional combiner: The map side merges value of the same key, reducing network traffic
- calculation mode
- request and
- Organization data (partition policy design)
- reduce-side (note that the key for 2 data sets is the same , the difference is the type of value, reducer need to make a distinction)
- This place is still a bit confused, Reducer received too much data, memory can not be installed to do?
- map-side (assuming L big R small, left-connected?) , the R is fully read into memory)
- Diagram Structure description /strong> (Distributed Computing Framework: How to describe automatically create individual compute nodes and topology connections based on a global diagram?) )
- Flumejava and tez*
- System architecture
- Master-Slave: Storm
- Dag topology (the topology here is similar to DirectShow filtergraph)
- COMPUTE nodes
- Data flow: Millwheel (Key, Value, TimeStamp), Storm (Tuple,...) ; S4 [Key, Attributes]
- Delivery Guarantee
- Storm "Delivery Once": XOR
- Millwheel: By state persistence (equivalent to static local variable in C function ...) ）
- Persistence of State
- Millwheel and Samza: Weak mode (Node B only receives ACK from downstream node C to send an ACK to upstream a)
- = = If C does not respond in a timely manner, B performs a state persistence to get rid of the dependence on C
- SAMZA: A compute node can use its status information as a message queue for Kafka
Interactive Data Analysis
- Stinger improvements: Vector query engine, CBO
- "Partial dag Execution Engine", which is essentially a dynamic optimization of SQL queries
- Data co-shard: The column that will be joined by hashing the different records of the same key into the same machine, subsequent to avoid network transmission overhead such as shuffle
- Dremel Department
- Dremel: Instead of converting user queries to Mr Tasks, the class MPP mechanism directly scans the data stored on disk (Advanced compilation technology?). ）
- Powerdrill: Load the data to be analyzed into memory? Skipping irrelevant data with a clever data structure?
- p262 Impalad uses C + + encoding to bypass Namenode to read HDFs directly, and query execution takes LLVM native code generation (nb! ）
- Although it looks good, it still needs to be improved (fault tolerance, UDF)
- Operators: Scan, Hashjoin, Hashaggregateion, Union, TopN, Exchange
- Presto (similar to Impala)
- Mixing system: hadoopdb
- Online Query class
- Three-layer structure
- Facebook TAO
- * Data Consistency
- Common graph mining problems
- Single Source Shortest Path
- Two maximum matching of graphs
- Figure Data Fragmentation
- Benche/Point cutting: Optimization problems, in fact, are random segmentation?
- Computational models
- Gas (Collection-application-distribution)
- Synchronous execution
- MapReduce (iterative iterations require multiple output of intermediate results to the file system, affecting system efficiency)
- Asynchronous execution
- Data consistency: full > Edge > vertex; sequence consistency (additional constraints)
- Figure database:pregel giraph (map-only, Netty) Graphchi (standalone, parallel sliding window PSW) powergraph (incremental cache)
machine Learning: Paradigm and Architecture
- Distributed machine Learning
- Each super step: Distributed Computing > Global Communication > Roadblock synchronization
- Spark and Mlbase
- Parameter Server *
machine Learning: Distributed algorithms *
- Logistic regression
- Parallel random gradient descent
- Matrix decomposition: ALS-WR
- Spectral clustering
- Deep Learning: Distbelief
- p371 "Snapshot Isolation" to resolve write conflicts?
Appendix A hardware architecture and common performance indicatorsAppendix B Big Data must-read literature
Big Data Day: Architecture and algorithmic notes