Big Data Day: Architecture and algorithmic notes

Source: Internet
Author: User
Tags ack message queue

Big Data Day Knowledge: architecture and AlgorithmsJump to: Navigation, search
  • 1 What we're talking about when it comes to big data
  • 2 data fragmentation and routing
  • 3 data replication and consistency
  • 4 common algorithms and data structures for big data
  • 5 cluster resource management and scheduling
  • 6 Distributed Coordination System
  • 7 Distributed Communication
  • 8 Data channel
  • 9 Distributed File System
  • Ten Memory kv
  • One Column Database
  • A Large Batch processing
  • - Flow-based computing
  • - Interactive Data Analysis
  • the Graph Database
  • - machine Learning: Paradigm and Architecture
  • - machine Learning: Distributed algorithms *
  • - Incremental Calculation
  • + Appendix A hardware architecture and common performance indicators
  • - Appendix B Big Data must-read literature
What we're talking about when it comes to big data
    1. IBM 3V (volume, speed, form) + value
    2. P7 uses social networks to predict the Dow Jones Indices by analysing public sentiment in Twitter.
data fragmentation and routing
    1. Membase (couchbase): "Virtual bucket"
    2. DHT consistent Hash
      1. Dynamo "Virtual Node"
data replication and consistency
    1. CAP:CP or AP?
    2. ACID
    3. BASE
      1. Soft state = Intermediate state between a state/stateless state?
    4. Consistency Model classification
      1. Strong: All processes see the latest values immediately after a write operation?
      2. Final: "Inconsistent window" (Can this time fragment be guaranteed?) Otherwise, it's a ghost.)
        1. Monotonic reads: If you read a version of V of the data, then all subsequent operations cannot see the older version of V (How do I define this ' follow-up '? )
        2. Monotone write: Guaranteed serialization of multiple write operations?
      3. Causal
        1. "Read what you've written."
          1. Session
    5. Replica Update policy
    6. Conformance protocol
      1. 2PC: Coordinator/Participant
        1. 3PC: Solve the problem of 2PC with long-time blocking, divide the submission stage into pre-commit and commit
      2. Vector clock
        1. Used to determine whether there is a causal relationship between events
      3. RWN (Data consistency: r+w>n)
      4. Paxos
        1. Ensure consistency of log copy data?
      5. Raft
        1. 3 Sub-issues: Leader election, log copy, security
        2. Term?
common algorithms and data structures for big data
    1. Bloom Filter
      1. Count BF: The basic information Unit is represented by multiple bit bits
    2. Skiplist: Random Find/Insert?
    3. LSM tree: Converting a large number of random writes into batch sequence writes
      1. e.g. LevelDB
    4. Merkle Hash tree (BitTorrent? Saharan
    5. LZSS
      1. Snappy: Match length >=4
    6. Cuckoo Hash
      1. With 2 hash functions, if 2 buckets are not empty, then the old element is kicked out, and the old element is re-executed with an infinite loop? Re-select the hash function
      2. Application: CMU Silt
cluster resource management and scheduling
    1. Scheduling System Paradigm
      1. Centralized type
      2. Two-stage
      3. State sharing
    2. Resource Scheduling policy
      1. Fifo
      2. Fair
      3. Ability
      4. Delay
      5. Master Resource Equity (DRF): Maximizes the amount of resources currently allocated to the fewest resources of the user
    3. Mesos: Level Two scheduling
    4. YARN: Support preemption
Distributed Coordination System
    1. Chubby Lock Service
      1. p93 If there is no fault, in general, the system will try to give the lease to the original master server
      2. KeepAlive mechanism
      3. Each "Chubby unit" master server stores memory snapshots to another data center, avoiding cyclic dependencies?
    2. ZooKeeper
      1. May read expired data and sync operation before read
      2. Replay log combined with fuzzy snapshot (Snapshot)?
      3. Znode: Persistent/Temporary
Distributed Communication
    1. Serialization and RPC Framework
      1. PB and Thrift
      2. Avro: Describe IDL with JSON?
    2. Message Queuing
      1. ZeroMQ (lightweight, message persistence not supported) > Kafka (at least once) > RabbitMQ > ActiveMQ?
        1. ISR (In-sync replicas)
    3. Application-Layer multicasting
      1. Gossip
        1. Inverse entropy model (random flood?) ): Push/pull/push-pull
        2. p117 If the node p is notified that Q has been updated to understand that "confession is rejected", then the spread of rumors model can be understood as: the number of rejected the more the more silent, to the end of the full stop to vindicate. Cons: There is no guarantee that all nodes will eventually get updated (well, the biggest match is not the pursuit of goals!) )
        3. Application: Cassandra Cluster Management
Data channel
    1. Log collection: Chukwa Scribe
    2. Data bus: Databus Wormhole
    3. Import and export: Sqoop?
Distributed File System
    1. Gfs
      1. Next Generation Colossus?
    2. Hdfs
      1. Ha Scenario: ANN/SNN
    3. HayStack: Merging small picture data, reducing metadata properties
    4. Storage layouts: Row-and-column blending
      1. Dremel columnstore: Name.Language.Code? (data item, repeating layer, defining layer)?
      2. Hybrid storage: Rcfile, Orcfile, parquet
    5. Erasure Code/mds*
      1. Reed-solomon
      2. LRC
        1. Block locality and minimum code spacing: for MDS configured (N,M), >=n and m+1 are respectively
Memory kv
    1. Ramcloud
    2. Redis
    3. Membase
Column Database
    1. BigTable
    2. PNUTS
    3. Megastore
    4. Spanner
      1. () Returns a time interval to ensure that the time at which the event actually occurs must fall within this interval?
Large Batch processing
    1. MapReduce
      1. Map Task divides the intermediate data into r parts and then notifies the reduce task to fetch (only all map tasks are completed and reduce can be started?).
        1. reduce pull, not map-side push, to support fine-grained fault tolerance (incisive!!) In effect, the synchronous blocking mode becomes an asynchronous trigger)
        2. The reduce task converts the intermediate result key into <key, list<value>> to the user-defined reduce function
            li> Note that here user reduce operates on global data (which may involve remote access ...).
      2. Optional combiner: The map side merges value of the same key, reducing network traffic
      3. calculation mode
        1. request and
        2. filtering
        3. Organization data (partition policy design)
        4. Join
          1. reduce-side (note that the key for 2 data sets is the same , the difference is the type of value, reducer need to make a distinction)
            1. This place is still a bit confused, Reducer received too much data, memory can not be installed to do?
          2. map-side (assuming L big R small, left-connected?) , the R is fully read into memory)
    2. DAG
      1. Dryad
        1. Diagram Structure description /strong> (Distributed Computing Framework: How to describe automatically create individual compute nodes and topology connections based on a global diagram?) )
      2. Flumejava and tez*
Flow-based computing
    1. System architecture
      1. Master-Slave: Storm
      2. P2p:s4
      3. Samza
    2. Dag topology (the topology here is similar to DirectShow filtergraph)
      1. COMPUTE nodes
      2. Data flow: Millwheel (Key, Value, TimeStamp), Storm (Tuple,...) ; S4 [Key, Attributes]
    3. Delivery Guarantee
      1. Storm "Delivery Once": XOR
      2. Millwheel: By state persistence (equivalent to static local variable in C function ...) )
    4. Persistence of State
      1. Millwheel and Samza: Weak mode (Node B only receives ACK from downstream node C to send an ACK to upstream a)
        1. = = If C does not respond in a timely manner, B performs a state persistence to get rid of the dependence on C
        2. SAMZA: A compute node can use its status information as a message queue for Kafka
Interactive Data Analysis
    1. Hive
      1. Stinger improvements: Vector query engine, CBO
    2. Shark
      1. "Partial dag Execution Engine", which is essentially a dynamic optimization of SQL queries
      2. Data co-shard: The column that will be joined by hashing the different records of the same key into the same machine, subsequent to avoid network transmission overhead such as shuffle
    3. Dremel Department
      1. Dremel: Instead of converting user queries to Mr Tasks, the class MPP mechanism directly scans the data stored on disk (Advanced compilation technology?). )
      2. Powerdrill: Load the data to be analyzed into memory? Skipping irrelevant data with a clever data structure?
      3. Impala
        1. p262 Impalad uses C + + encoding to bypass Namenode to read HDFs directly, and query execution takes LLVM native code generation (nb! )
        2. Although it looks good, it still needs to be improved (fault tolerance, UDF)
        3. Operators: Scan, Hashjoin, Hashaggregateion, Union, TopN, Exchange
      4. Presto (similar to Impala)
    4. Mixing system: hadoopdb
Graph Database
    1. Online Query class
      1. Three-layer structure
      2. Facebook TAO
        1. * Data Consistency
    2. Common graph mining problems
      1. PageRank
      2. Single Source Shortest Path
      3. Two maximum matching of graphs
    3. Figure Data Fragmentation
      1. Benche/Point cutting: Optimization problems, in fact, are random segmentation?
    4. Computational models
      1. Node-centric
      2. Gas (Collection-application-distribution)
      3. Synchronous execution
        1. Bsp
        2. MapReduce (iterative iterations require multiple output of intermediate results to the file system, affecting system efficiency)
      4. Asynchronous execution
        1. Data consistency: full > Edge > vertex; sequence consistency (additional constraints)
    5. Figure database:pregel giraph (map-only, Netty) Graphchi (standalone, parallel sliding window PSW) powergraph (incremental cache)
machine Learning: Paradigm and Architecture
    1. Distributed machine Learning
      1. Mapreduce
      2. Bsp
        1. Each super step: Distributed Computing > Global Communication > Roadblock synchronization
      3. Ssp?
    2. Spark and Mlbase
    3. Parameter Server *
machine Learning: Distributed algorithms *
    1. Logistic regression
    2. Parallel random gradient descent
    3. Matrix decomposition: ALS-WR
    4. Lambdamart
    5. Spectral clustering
    6. Deep Learning: Distbelief
Incremental Calculation
    1. Percolator
      1. p371 "Snapshot Isolation" to resolve write conflicts?
    2. Kineograph
    3. Dryadinc
Appendix A hardware architecture and common performance indicatorsAppendix B Big Data must-read literature

Big Data Day: Architecture and algorithmic notes

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.