Database Basics
In general, there are two types of databases: relational and non-relational.
ACID (Atomicity, Consistency, Isolation, and Durability)
- Atomicity ensures a transaction is treated as a single unit (either all operations execute or none do), which prevents data corruption in case of failure with rolling back transaction.
- Consistency: Maintains database integrity by ensuring transactions only bring the database from one valid state to another, preventing unintended consequences.
- Isolation: Ensures concurrent transactions do not interfere with each other, making them appear as if executed sequentially, preserving data accuracy.
- Durability: Guarantees that once a transaction is committed, its changes persist permanently, even in case of system failures.
Relational Databases
Pros:
- Data Integrity: Enforces strict relationships and constraints, reducing inconsistencies and anomalies.
- Powerful Querying: Supports complex queries, joins, and aggregations using SQL.
- ACID Compliance: Ensures reliable transactions with Atomicity, Consistency, Isolation, and Durability.
- Structured Data: Organizes data in a predefined schema, making retrieval and manipulation efficient.
- Standardization: SQL is widely used, ensuring compatibility across different systems.
Cons
- Scalability Limitations: Vertical scaling can be costly and less efficient for handling massive data volumes.
- Complex Schema Management: Rigid schemas make changes and updates challenging.
- Performance: Handling large joins and complex queries can slow down response times.
- Hardware & Maintenance Costs: Requires significant resources for optimization, storage, and administration.
- Not Ideal for Unstructured Data: Struggles with handling semi-structured or unstructured data like images or JSON.
No-SQL Databases
All NoSQL databases can be divided into following categories:
No-SQL DB Type | Description | Use Case | Products |
---|---|---|---|
Key-Value store | Store data as key-value pairs, enabling fast lookups and horizontal scaling. Keys are unique, while values can be simple or complex. | Session management, Caching, | DynamoDB, Redis, MemCache |
Document database | Store hierarchical data structures, usually store and retrieve documents in formats like XML, JSON, BSON | Product attributes in e-commerce, CMS like blogs or video platforms, etc. | MongoDB, Google Cloud Firestore |
Graph Databases | use the graph data structure to store data, where nodes represent entities, and edges show relationships between entities | data regulation and privacy, machine learning research, financial services-based applications, etc. | Neo4j, OrientDB, InfiniteGraph |
Column-Oriented Databases | Store data in columns instead of rows | aggregation and data analytics queries | HBase, Hypertable, Amazon Redshift, Cassandra |
Time-Series Databases | Time-series databases are specifically designed to store, retrieve, and analyze data points that are indexed by time. | sensors, IoT devices, server metrics, financial market data, logs, etc. |
Data Replication
- Synchronous replication - waits for acknowledgments from secondary nodes, then primary node reports success to the client.
- Asynchronous replication - primary node doesn’t wait for the acknowledgment from the secondary nodes to report success to the client
Data replication models:
- Primary-secondary replication
- Multi-leader replication
- Peer-to-peer or leaderless replication
Primary-Secondary Replication
Use When: systems with heavy reads
Replication methods:
Write ahead log (WAL) - when transaction is initiates, it initially is recorded into transactional log. Which is executed on primary, then secondary instances
Logical (row-based) replication - changes are captured at individual rows level, then replicated to the secondary nodes
Multi-Leader Replication
Use When: globally distributed applications
Conflict Resolution Tactics:
- Avoid Conflicts by redirecting collocating client's writes to the same primary node
- Last-write-wins nodes using timestamp to identify most recent update. However, in multi node scenario clock synchronization could be challenging
- Custom application handler is implemented to resolve conflicts
Peer-to-Peer Replication
Such replication uses Quorums to ensure data persistence.
, where - # of nodes, - # of update nodes before update is considered successful, - # of node to read from.
For example:
Data Partitioning (Sharding)
Data partitioning is the process of dividing a large dataset into smaller, more manageable parts, called partitions, to improve performance, scalability, and availability. It helps distribute data across multiple nodes in a database, reducing query load and enabling parallel processing.
Sharding is a type of partitioning, whihc imply data distribution across multiple servers
Vertical Partitioning:
Splits columns into different tables. Use cases: optimize data retrival by splitting columns with large objects/ long text into separate table
Horizontal Partitioning (Sharding):
Divides rows across multiple tables or databases.
Key-range based sharding
Hash-based Sharding
Hash-based sharding uses a hash function to devide rows between shards.
To optimize shards load usually use consistent hashing.
Partitioning Challenges
Secondary Indexes
Thre are 2 approaches:
- indexes by document - each shard keeps its own indexes, which require multiple shard interaction on read.
- indexes by the term - we mantain a global index for secondary terms that encompasses data from all partitions. Optimize read opeartions, however a single write in this approach affects multiple partitions
Routing
Clients need to determine which node to contact when making requests, especially after rebalancing (adding/removing shards).
Three approaches exist:
- Clients query any node, which forwards the request if needed.
- A routing tier directs requests to the correct node.
- Clients store partitioning details and contact the appropriate node directly.
ZooKeeper helps by tracking cluster changes, notifying nodes and the routing tier about updates. (used by **HBase, Kafka, etc.)