Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

MongoDB
        distributed transactions
     top to bottom

Henrik Ingo

HighLoad++ 2019

@h_ingo    henrikingo.github.io/presentations/

Table of Contents

@h_ingo    henrikingo.github.io/presentations/

Circa 2007...

Cassandra sidebar

What about RDBMS?

  • Oracle RAC
    • Not webscale
  • MySQL
    • Replication -> Galera Cluster
    • MySQL Fabric
    • Vitesse?
  • PostgreSQL
    • Logical Replication in 2017
    • Sharding: Citus

WHY?

WHY?

Multi Version Concurrency Control

Write Conflict

> startTransaction()
> find()
{ "_id" : 1, "balance" : 100 }
{ "_id" : 2, "balance" : 50 }
> update({ _id:1}, {$set: {balance:100-10} })
>find()
{ "_id" : 1, "balance" : 90 }
{ "_id" : 2, "balance" : 50 }
> startTransaction()
> find()
{ "_id" : 1, "balance" : 100 }
{ "_id" : 2, "balance" : 50 }
> update({ _id:1}, {$inc: {balance:100-11} })
WriteCommandError({
        "errorLabels" : ["TransientTransactionError"],
        "operationTime" : Timestamp(1572204799, 5),
        "errmsg" : "WriteConflict",
        "code" : 112,   ...
})
> commitTransaction()
> db.accounts.find()
{ "_id" : 1, "balance" : 90 }
{ "_id" : 2, "balance" : 50 }

Write Skew

> db.supportteam.find()
{ "_id" : "Alice", "status" : "oncall" }
{ "_id" : "Bob", "status" : "oncall" }



> startTransaction()
> findOne({_id:"Bob"}).status
oncall
> update({_id:"Alice"}, {$set: {status:"free"} })
> startTransaction()
> findOne({_id:"Alice"}).status
oncall
> update({_id:"Bob"}, {$set: {status:"free"} })
> commitTransaction()
> commitTransaction()
> db.supportteam.find()
{ "_id" : "Alice", "status" : "free" }
{ "_id" : "Bob", "status" : "free" }

Serializable

Avoids any and all anomalies, incl write skew.

Spanner sidebar

  • How do you even do distributed serializable?
    • Classic solution: dependency graph of all transactions
    • Not very distributed
  • Google: We just order trx by wall clock timestamp
    • You can't trust server clock!!
    • Google: We put atomic clocks in every data center
    • Corbett et.al, 2012
  • As of 2017 GCP offers TrueTime as a service

Replication

(Durability)

  • Ongaro, 2014: Raft replication      (Ingo, 2015)
  • MongoDB predates Raft, but is similar:
    • Single primary, many secondaries
    • Initial sync
    • Majority based
    • Transaction log
  • Differences
    • Pull vs push
    • Apply first, then replicate

Distributed Snapshot Isolation

  • Basic concept:
    • Clients can ask to read a specific MVCC snapshot Tn
    • Ask for same Tn from each shard
  • So now we just need:
    • Storage engine must use externally provided Tn
    • Writes Tn and Tn+1 must commit in same order on all shards
    • Cluster-wide clock

Lamport clock

  • Tyulenev et.al, 2019. Lamport, 1978.
  • MongoDB: Monotonically increasing timestamp
  • Gossip: If A communicates with B, pass time
    • Causality: event on A happened before event on B
    • Partial ordering: If A and B don't communicate, no causality
    • Clients pass time too
  • Causal Consistency is the strongest partition-tolerant consistency level (jepsen.io)

Understanding causality

MongoDB tunable consistency

  • Typically choosing sync vs async replication is a global setting. Or not an option at all!
  • MongoDB: Replication internals async, but each client can choose sync or async semantics.
  • Even writes vs reads tunable separately. For example, linearizable read + non-durable write...
  • Schultz et.al, 2019

Two Phase Commit (2PC)

  • Mohan et.al, 1986.
  • PREPARE -> all YES -> COMMIT
    • Prepare is binding: Must be durable
    • Commit must succeed
    • Crash recovery is non-trivial
    • MongoDB: 2x majority commit
  • Used
    • Between shards
    • Replication! (MySQL NDB Cluster)
    • Between different databases (1 Oracle, 1 DB2)

Summary

What did we learn today:

 

MVCC, Snapshot Isolation, Write Skew, Serializeable/SSI, Raft, Lamport, Causal Consistency, Tunable consistency, 2PC.

Bonus slides: NoSQL typology, Cassandra, Dynamo, CAP, LSM, Spanner.

@h_ingo    henrikingo.github.io/presentations/