Designing Data-Intensive Applications

The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Martin Kleppmann

Publisher: O'Reilly, 2017, 590 pages

ISBN: 978-1-449-37332-0

Keywords: Information Systems, IT Architecture

Last modified: April 1, 2021, 9:05 p.m.

Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords?

In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications.

  • Peer under the hood of the systems you already use, and learn how to use and operate them more effectively
  • Make informed decisions by identifying the strengths and weaknesses of different tools
  • Navigate the trade-offs around consistency, scalability, fault tolerance, and complexity
  • Understand the distributed systems research upon which modern databases are built
  • Peek behind the scenes of major online services, and learn from their architectures
  1. Foundations of Data Systems
    1. Reliable, Scalable, and Maintainable Applications
      • Thinking About Data Systems
      • Reliability
        • Hardware Faults
        • Software Errors
        • Human Errors
        • How Important Is Reliability?
      • Scalability
        • Describing Load
        • Describing Performance
        • Approaches for Coping with Load
      • Maintainability
        • Operability: Making Life Easy for Operations
        • Simplicity: Managing Complexity
        • Evolvability: Making Change Easy
      • Summary
    2. Data Models and Query Languages
      • Relational Model Versus Document Model
        • The Birth of NoSQL
        • The Object-Relational Mismatch
        • Many-to-One and Many-to-Many Relationships
        • Are Document Databases Repeating History?
        • Relational Versus Document Databases Today
      • Query Languages for Data
        • Declarative Queries on the Web
        • MapReduce Querying
      • Graph-Like Data Models
        • Property Graphs
        • The Cypher Query Language
        • Graph Queries in SQL
        • Triple-Stores and SPARQL
        • The Foundation: Datalog
      • Summary
    3. Storage and Retrieval
      • Data Structures That Power Your Database
        • Hash Indexes
        • SSTables and LSM-Trees
        • B-Trees
        • Comparing B-Trees and LSM-Trees
        • Other Indexing Structures
      • Transaction Processing or Analytics?
        • Data Warehousing
        • Stars and Snowflakes: Schemas for Analytics
      • Column-Oriented Storage
        • Column Compression
        • Sort Order in Column Storage
        • Writing to Column-Oriented Storage
        • Aggregation: Data Cubes and Materialized Views
      • Summary
    4. Encoding and Evolution
      • Formats for Encoding Data
        • Language-Specific Formats
        • JSON, XML, and Binary Variants
        • Thrift and Protocol Buffers
        • Avro
        • The Merits of Schemas
      • Modes of Dataflow
        • Dataflow Through Databases
        • Dataflow Through Services: REST and RPC
        • Message-Passing Dataflow
      • Summary
  2. Distributed Data
    1. Replication
      • Leaders and Followers
        • Synchronous Versus Asynchronous Replication
        • Setting Up New Followers
        • Handling Node Outages
        • Implementation of Replication Logs
      • Problems with Replication Lag
        • Reading Your Own Writes
        • Monotonic Reads
        • Consistent Prefix Reads
        • Solutions for Replication Lag
      • Multi-Leader Replication
        • Use Cases for Multi-Leader Replication
        • Handling Write Conflicts
        • Multi-Leader Replication Topologies
      • Leaderless Replication
        • Writing to the Database When a Node Is Down
        • Limitations of Quorum Consistency
        • Sloppy Quorums and Hinted Handoff
        • Detecting Concurrent Writes
      • Summary
    2. Partitioning
      • Partitioning and Replication
      • Partitioning of Key-Value Data
        • Partitioning by Key Range
        • Partitioning by Hash of Key
        • Skewed Workloads and Relieving Hot Spots
      • Partitioning and Secondary Indexes
        • Partitioning Secondary Indexes by Document
        • Partitioning Secondary Indexes by Term
      • Rebalancing Partitions
        • Strategies for Rebalancing
        • Operations: Automatic or Manual Rebalancing
      • Request Routing
        • Parallel Query Execution
      • Summary
    3. Transactions
      • The Slippery Concept of a Transaction
        • The Meaning of ACID
        • Single-Object and Multi-Object Operations
      • Weak Isolation Levels
        • Read Committed
        • Snapshot Isolation and Repeatable Read
        • Preventing Lost Updates
        • Write Skew and Phantoms
      • Serializability
        • Actual Serial Execution
        • Two-Phase Locking (2PL)
        • Serializable Snapshot Isolation (SSI)
      • Summary
    4. The Trouble with Distributed Systems
      • Faults and Partial Failures
        • Cloud Computing and Supercomputing
      • Unreliable Networks
        • Network Faults in Practice
        • Detecting Faults
        • Timeouts and Unbounded Delays
        • Synchronous Versus Asynchronous Networks
      • Unreliable Clocks
        • Monotonic Versus Time-of-Day Clocks
        • Clock Synchronization and Accuracy
        • Relying on Synchronized Clocks
        • Process Pauses
      • Knowledge, Truth, and Lies
        • The Truth Is Defined by the Majority
        • Byzantine Faults
        • System Model and Reality
      • Summary
    5. Consistency and Consensus
      • Consistency Guarantees
      • Linearizability
        • What Makes a System Linearizable?
        • Relying on Linearizability
        • Implementing Linearizable Systems
        • The Cost of Linearizability
      • Ordering Guarantees
        • Ordering and Causality
        • Sequence Number Ordering
        • Total Order Broadcast
      • Distributed Transactions and Consensus
        • Atomic Commit and Two-Phase Commit (2PC)
        • Distributed Transactions in Practice
        • Fault-Tolerant Consensus
        • Membership and Coordination Services
      • Summary
  3. Derived Data
    1. Batch Processing
      • Batch Processing with Unix Tools
        • Simple Log Analysis
        • The Unix Philosophy
      • MapReduce and Distributed Filesystems
        • MapReduce Job Execution
        • Reduce-Side Joins and Grouping
        • Map-Side Joins
        • The Output of Batch Workflows
        • Comparing Hadoop to Distributed Databases
      • Beyond MapReduce
        • Materialization of Intermediate State
        • Graphs and Iterative Processing
        • High-Level APIs and Languages
      • Summary
    2. Stream Processing
      • Transmitting Event Streams
        • Messaging Systems
        • Partitioned Logs
      • Databases and Streams
        • Keeping Systems in Sync
        • Change Data Capture
        • Event Sourcing
        • State, Streams, and Immutability
      • Processing Streams
        • Uses of Stream Processing
        • Reasoning About Time
        • Stream Joins
        • Fault Tolerance
      • Summary
    3. The Future of Data Systems
      • Data Integration
        • Combining Specialized Tools by Deriving Data
        • Batch and Stream Processing
      • Unbundling Databases
        • Composing Data Storage Technologies
        • Designing Applications Around Dataflow
        • Observing Derived State
      • Aiming for Correctness
        • The End-to-End Argument for Databases
        • Enforcing Constraints
        • Timeliness and Integrity
        • Trust, but Verify
      • Doing the Right Thing
        • Predictive Analytics
        • Privacy and Tracking
      • Summary

Reviews

Designing Data-Intensive Applications

Reviewed by Roland Buresund

Very Good ******** (8 out of 10)

Last modified: Jan. 8, 2024, 2:46 p.m.

This is not really about applications per se, but more about the underlying storage and communication systems that an applications may utilize, and how they work.

With that said, it makes a very good overview, with the right amount of detail for someone to understand the pros and cons of all the areas discussed, without going into too much specific implementation details. If nothing else, you will get a thorough overview of all the current buzzwords that builders of the infrastructure like to bandy around.

I am sure it will be outdated sometime in the future, but for now, this is one of the best books on the subject that you can buy. Be warned, this is only for the tech interested audience, that like to understand the nitty-gritty of the current systems that are proposed to be utilized as the back-end of many applications.

Comments

There are currently no comments

New Comment

required

required (not published)

optional

required

captcha

required