Skip to main content

Database Basics

In general, there are two types of databases: relational and non-relational.

ACID (Atomicity, Consistency, Isolation, and Durability)

  • Atomicity ensures a transaction is treated as a single unit (either all operations execute or none do), which prevents data corruption in case of failure with rolling back transaction.
  • Consistency: Maintains database integrity by ensuring transactions only bring the database from one valid state to another, preventing unintended consequences.
  • Isolation: Ensures concurrent transactions do not interfere with each other, making them appear as if executed sequentially, preserving data accuracy.
  • Durability: Guarantees that once a transaction is committed, its changes persist permanently, even in case of system failures.

Relational Databases

Pros:

  • Data Integrity: Enforces strict relationships and constraints, reducing inconsistencies and anomalies.
  • Powerful Querying: Supports complex queries, joins, and aggregations using SQL.
  • ACID Compliance: Ensures reliable transactions with Atomicity, Consistency, Isolation, and Durability.
  • Structured Data: Organizes data in a predefined schema, making retrieval and manipulation efficient.
  • Standardization: SQL is widely used, ensuring compatibility across different systems.

Cons

  • Scalability Limitations: Vertical scaling can be costly and less efficient for handling massive data volumes.
  • Complex Schema Management: Rigid schemas make changes and updates challenging.
  • Performance: Handling large joins and complex queries can slow down response times.
  • Hardware & Maintenance Costs: Requires significant resources for optimization, storage, and administration.
  • Not Ideal for Unstructured Data: Struggles with handling semi-structured or unstructured data like images or JSON.

No-SQL Databases

All NoSQL databases can be divided into following categories:

No-SQL DB TypeDescriptionUse CaseProducts
Key-Value storeStore data as key-value pairs, enabling fast lookups and horizontal scaling. Keys are unique, while values can be simple or complex.Session management, Caching,DynamoDB, Redis, MemCache
Document databaseStore hierarchical data structures, usually store and retrieve documents in formats like XML, JSON, BSONProduct attributes in e-commerce, CMS like blogs or video platforms, etc.MongoDB, Google Cloud Firestore
Graph Databasesuse the graph data structure to store data, where nodes represent entities, and edges show relationships between entitiesdata regulation and privacy, machine learning research, financial services-based applications, etc.Neo4j, OrientDB, InfiniteGraph
Column-Oriented DatabasesStore data in columns instead of rowsaggregation and data analytics queriesHBase, Hypertable, Amazon Redshift, Cassandra
Time-Series DatabasesTime-series databases are specifically designed to store, retrieve, and analyze data points that are indexed by time.sensors, IoT devices, server metrics, financial market data, logs, etc.

Data Replication

  • Synchronous replication - waits for acknowledgments from secondary nodes, then primary node reports success to the client.
  • Asynchronous replication - primary node doesn’t wait for the acknowledgment from the secondary nodes to report success to the client

Data replication models:

  • Primary-secondary replication
  • Multi-leader replication
  • Peer-to-peer or leaderless replication

Primary-Secondary Replication

Use When: systems with heavy reads

"Primary-Secondary Replication" Replication methods: Write ahead log (WAL) - when transaction is initiates, it initially is recorded into transactional log. Which is executed on primary, then secondary instances Logical (row-based) replication - changes are captured at individual rows level, then replicated to the secondary nodes

Multi-Leader Replication

Use When: globally distributed applications

"Multi-Leader Replication" Conflict Resolution Tactics:

  • Avoid Conflicts by redirecting collocating client's writes to the same primary node
  • Last-write-wins nodes using timestamp to identify most recent update. However, in multi node scenario clock synchronization could be challenging
  • Custom application handler is implemented to resolve conflicts

Peer-to-Peer Replication

Such replication uses Quorums to ensure data persistence.

w+r>nw+r > n, where nn - # of nodes, ww - # of update nodes before update is considered successful, rr - # of node to read from.

For example: 2(writes)=2(reads)>3(nodes)2 (writes)=2 (reads) > 3 (nodes)

Data Partitioning (Sharding)

Data partitioning is the process of dividing a large dataset into smaller, more manageable parts, called partitions, to improve performance, scalability, and availability. It helps distribute data across multiple nodes in a database, reducing query load and enabling parallel processing.

Sharding is a type of partitioning, whihc imply data distribution across multiple servers

Vertical Partitioning:

Splits columns into different tables. Use cases: optimize data retrival by splitting columns with large objects/ long text into separate table

Horizontal Partitioning (Sharding):

Divides rows across multiple tables or databases.

Key-range based sharding

Hash-based Sharding

Hash-based sharding uses a hash function to devide rows between shards.

To optimize shards load usually use consistent hashing.

Partitioning Challenges

Secondary Indexes

Thre are 2 approaches:

  • indexes by document - each shard keeps its own indexes, which require multiple shard interaction on read.
  • indexes by the term - we mantain a global index for secondary terms that encompasses data from all partitions. Optimize read opeartions, however a single write in this approach affects multiple partitions

Routing

Clients need to determine which node to contact when making requests, especially after rebalancing (adding/removing shards).

Three approaches exist:

  1. Clients query any node, which forwards the request if needed.
  2. A routing tier directs requests to the correct node.
  3. Clients store partitioning details and contact the appropriate node directly.

ZooKeeper helps by tracking cluster changes, notifying nodes and the routing tier about updates. (used by **HBase, Kafka, etc.)