The cost of MongoDB ACID transactions in theory and practice

## The cost of # MongoDB ACID transactions ## in theory and practice . Henrik Ingo Senior Performance Engineer, MongoDB Highload++ 2018 ----- # Agenda * MongoDB features: * write concern, read preference, read concern * Durability cost * Theory * Practice * Consistency level cost * Theory * Practice ----- # Write Concerns (Durability) ![Replica Set with different write concerns](images/write_concerns.svg) ----- # Write Concern (Durability) * `w:0` * `w:1` (async) * `j:true` (fsync) * `w:2` * `w:majority` (sync) * `w:N` * `w:majority + j:true` ----- # Write Concerns Notes * `w:0` really? Yes, really. * Default: No durability * Durability is client side option: Power to developers! * Replication wins fsyncs for durability * `writeConcernMajorityJournalDefault` - ugly duckling: * server side * fsync by default (safe by default, wtf?) * `w:majority` and `w:2` behave different * Follow Raft paper ----- # Consistency (SQL: Isolation) * readPreference * primary (CP-ish) * secondary (eventual consistency, AP) * readConcern * local (read uncommitted, AP) * majority (read committed) * linearizable (serializable) * snapshot * sessions * causal consistency * transactions (ACID) * Atomic for multi-document, multi-statement ----- # Write Concern cost ![Replica Set with different write concerns](images/write_concerns.svg) ----- # Write Concern cost (theory) | setting | latency (theory) | |:--------|-----------------:| | `w:0` | 0 | | **`w:1`** | 1 rtt | | `w:2` | 2 rtt | | `w:N` | 2 rtt | | `j:true` | 1 fsync | | `w:2 j:true` | 2 rtt + 1 fsync | | `w:majority` *(1)* | 2 rtt + 1 fsync | | `w:majority j:false` | 2 rtt | | `w:majority j:true` | 2 rtt + 1 fsync | *1) See [writeConcernMajorityJournalDefault](https://docs.mongodb.com/manual/reference/write-concern/#acknowledgement-behavior)* ----- # Test cluster 1 (Same AZ+PG) ![Test cluster 1](images/cluster1.svg) ----- # Simple update (single threaded) db.hltest.update_one( {'_id': 1}, {'$inc': {'n': 1}} ) * Note: Several results would be different for multi-threaded high load test. This test gives insight into basic behavior of the implementation. * [> Full benchmark code](single_threaded.py) ----- # Write Concerns ![Graph](images/durability1.png) ----- # Observations * `j:true` is 2x slower than `w:1`. This is reasonable! * `w:majority` slower than `j:true`! * Reason: To ensure data integrity, oplog must be flushed before new entries become visible. * Replication cannot possibly be faster than fsync. * Using `j:true` actually makes replication faster (25%) * Note: You still MUST use `w:majority` for durability. `j:true` does not ensure durability across cluster during failover. ----- # Test cluster 2 (Multi-AZ) ![Test cluster 2](images/cluster2.svg) ----- # Write Concerns ![Graph](images/durability2.png) ----- # Observations * Largely similar to single AZ results * Use of SSL adds 0.15 ms to `w:0` * fsync and replication latency dominate, so client2 is not disadvantaged except for `w:1` ----- # Test cluster 3 (Multi-region) ![Test cluster 3](images/cluster3.svg) ----- # Write Concerns ![Graph](images/durability3.png) ----- # Observations * Geographical latency dominates everything! * Secondaries with uneven RTT, so `w:2` and `w:3` different * Even `w:0` has its limits, average = 5 ms as TCP buffers fill up. * Poor 99% percentile for client2 with `j:true w:2` ----- # Atlas Latency Calculator ![World map with latencies](images/AtlasLatencyMap.png)

# How many # **Isolation Levels** # do you know?

# Consistency Levels ![Jepsen map of isolation levels](images/jepsen-consistency-map.svg) [(C) Kyle Kingsbury jepsen.io/consistency](https://jepsen.io/consistency) ----- # Consistency Levels & MongoDB ![Jepsen map with MongoDB isolation levels overlaid](images/mongodb-consistency-map.svg) ----- # Consistency Levels Details _r:majority_ has no latency overhead, but has overhead on the _MVCC_ storage engine needing to keep older snapshots in RAM. _Linearizeable_ consistency is implemented by turning reads into no-op writes with `w:majority`. _Causal sessions_ are implemented by passing latest timestamp to clients. If all clients were to (telepathically) exchange these timestamps with each other, the result is equivalent to _Linearizable_ isolation. MongoDB transactions require a session. Therefore transactions provide both _Snapshot Isolation_ and _Causal Consistency_. ----- # Simple transaction (1 thread) amount = random.randint(-100, 100) result1 = db.hltest.find_one( { '_id': 1 } ) db.hltest.update_one( {'_id': 1}, {'$inc': {'n': -amount}} ) result2 = db.hltest.find_one( { '_id': 2 } ) db.hltest.update_one( {'_id': 2}, {'$inc': {'n': amount}} ) result = db.hltest.aggregate( [ { '$group': { '_id': 'foo', 'total' : { '$sum': '$n' } } } ] ) assert result['total'] == 200 * Note: Several results would be different for multi-threaded high load test. This test gives insight into basic behavior of the implementation. * [> Full benchmark code](single_threaded.py) ----- # Consistency levels (1) ![Graph](images/consistency1.png) ----- # Observations * No latency overhead from `session` * `session` is faster than `linearizable` * Transaction a little faster than no transaction! * There is only 1 `w:majority` roundtrip at `commitTransaction` * No errors from invariant, yet... * [PYTHON-1668](https://jira.mongodb.org/browse/PYTHON-1668): Transactions: Pymongo was ignoring default connection settings, must set `w` & `r` in `start_transaction()` ----- # Consistency levels (2) ![Graph](images/consistency2.png) ----- # Observations * Invariant errors on * `w:1 r:local rp:secondary` * `w:m r:m rp:secondary` * No error when using session! * (`linearizable` cannot read from secondary.) ----- # Consistency levels (3) ![Graph](images/consistency3.png) ----- # Observations * Same results as before, but more accentuated: * Transacion now fully 2x faster than without * Because there are 2 updates * `readPreference:secondary` now clearly benefits client2 ----- # Summary of main findings * To avoid eventual consistency effects, use `linearizable`, `session` or `transaction`. * `readPreference:secondary` benefit in global clusters * Note: For full ACID guarantees, use transaction. * `session` is similar to `linearizable` but no latency overhead. * `w:majority` was slower than `j:true`. * You still must use `w:majority`. * Transaction faster than without! * (But `readConcern:snapshot` has more overhead on a large and busy server, so overall transactions may or may not be faster.) ----- # Future * Multi-shard transactions (4.2) * Based on 2 Phase Commit. * Needs 2x `w:majority` commit on each shard. Better make them faster :-) * Remove need for fsync before oplog reads * Transactions > 16 MB * Similar test for readonly trx (test `readConcern:snapshot`)