Introduction to NoSQL
NoSQL (NoSQL = Not Only SQL), meaning "not just SQL."
In modern computing systems, vast amounts of data are generated daily across the internet.
A significant portion of this data is handled by relational database management systems (RDBMS). The relational model proposed by E.F.Codd in 1970 in his paper "A relational model of data for large shared data banks" has simplified data modeling and application programming.
Practical applications have proven that the relational model is highly suitable for client-server programming, delivering far more benefits than anticipated. Today, it is the dominant technology for structured data storage in web and business applications.
NoSQL represents a revolutionary movement in database technology, initially proposed early on and gaining significant momentum by 2009. Advocates of NoSQL promote the use of non-relational data storage, offering a fresh approach compared to the widespread use of relational databases.
Relational Databases Follow ACID Rules
A transaction, analogous to a real-world transaction, has four characteristics:
1. A (Atomicity) Atomicity
Atomicity is straightforward; all operations within a transaction must either complete entirely or not at all. A transaction is successful only if all operations succeed; any failure results in a rollback.
For example, in a bank transfer from account A to B, two steps are involved: 1) Withdraw 100 from A; 2) Deposit 100 to B. Both steps must complete together, or neither should, to avoid discrepancies like missing funds.
2. C (Consistency) Consistency
Consistency ensures the database remains in a consistent state; the operation of a transaction does not violate existing consistency constraints.
For instance, with an integrity constraint a+b=10, if a transaction changes a, it must also change b to maintain a+b=10, or the transaction fails.
3. I (Isolation) Isolation
Isolation means concurrent transactions do not interfere with each other. If a transaction accesses data being modified by another, it only sees the uncommitted changes if the other transaction is committed.
For example, during a transfer from A to B, if B checks their account before the transaction completes, they won't see the new deposit.
4. D (Durability) Durability
Durability ensures that once a transaction is committed, its changes are permanently saved in the database, even in the event of a system failure.
Distributed Systems
A distributed system consists of multiple computers and software components connected via a computer network (local or wide area).
Built on networks, distributed systems benefit from software's characteristics, offering high cohesion and transparency. Thus, the distinction between networks and distributed systems lies more in the higher-level software, particularly the operating system, rather than hardware.
Distributed systems can operate on various platforms, including PCs, workstations, LANs, and WANs.
Advantages of Distributed Computing
Reliability (Fault Tolerance):
A significant advantage of distributed systems is reliability; a single server failure does not affect the others.
Scalability:
Distributed systems can scale by adding more machines as needed.
Resource Sharing:
Sharing data is essential for applications like banking and reservation systems.
Flexibility:
These systems are highly flexible, making them easy to install, implement, and debug new services.
Faster Speed:
With the computing power of multiple machines, distributed systems offer faster processing speeds.
Open System:
Being an open system allows local or remote access to services.
Higher Performance:
Compared to centralized computer networks, clusters offer higher performance and better cost efficiency.
Disadvantages of Distributed Computing
Troubleshooting:
Issues with troubleshooting and diagnostics.
Software:
Limited software support is a major drawback.
Network:
Problems with network infrastructure, including transmission issues, high loads, and data loss.
Security:
The open nature of distributed systems poses risks related to data security and sharing.
What is NoSQL?
NoSQL refers to non-relational databases. Sometimes referred to as Not Only SQL, it represents a category of database management systems different from traditional relational databases.
NoSQL is used for storing extremely large datasets (e.g., Google or Facebook collecting trillions of bits of data daily). These data stores do not require a fixed schema and can scale out horizontally with minimal effort.
Why Use NoSQL?
Today, accessing and extracting data through third-party platforms (e.g., Google, Facebook) is straightforward. User data like personal information, social networks, locations, user-generated content, and logs have multiplied. Analyzing such data is no longer suitable for SQL databases; NoSQL databases are better equipped to handle these vast datasets.
Examples
Social Networking Sites:
Wikipedia Pages:
RDBMS vs NoSQL
RDBMS
NoSQL
Brief History of NoSQL
The term NoSQL was first used in 1998 by Carlo Strozzi for a lightweight, open-source relational database that did not offer SQL functionalities.
In 2009, Johan Oskarsson of Last.fm initiated a discussion about distributed open-source databases, and Eric Evans from Rackspace reintroduced the term NoSQL, which at that time primarily referred to non-relational, distributed, and non-ACID database design patterns.
The "no:sql(east)" conference held in Atlanta in 2009 marked a milestone, with the slogan "select fun, profit from real_world where relational=false;". Therefore, the most common interpretation of NoSQL is "non-relational," emphasizing the advantages of Key-Value Stores and document databases, rather than simply opposing RDBMS.
CAP Theorem
In computer science, the CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computing system to simultaneously provide all three of the following guarantees:
- Consistency (all nodes see the same data at the same time)
- Availability (every request receives a response about whether it succeeded or failed)
- Partition Tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
The core of the CAP theorem is that a distributed system cannot simultaneously and effectively satisfy consistency, availability, and partition tolerance; it can only do well in two of these three areas at the same time.
Therefore, based on the CAP principle, NoSQL databases are divided into three categories: those that satisfy CA, CP, and AP principles:
- CA - Single-site clusters that satisfy consistency and availability, usually not very powerful in scalability.
- CP - Systems that satisfy consistency and partition tolerance, usually not very high in performance.
- AP - Systems that satisfy availability and partition tolerance, usually with lower requirements for consistency.
Advantages/Disadvantages of NoSQL
Advantages:
- High scalability
- Distributed computing
- Low cost
- Flexible architecture, semi-structured data
- No complex relationships
Disadvantages:
- Lack of standardization
- Limited query functionality (so far)
- Eventual consistency is not intuitive for programming
BASE
BASE: Basically Available, Soft-state, Eventually Consistent. Defined by Eric Brewer.
The core of the CAP theorem is that a distributed system cannot simultaneously and effectively satisfy consistency, availability, and partition tolerance; it can only do well in two of these three areas at the same time.
BASE is the principle that NoSQL databases usually adopt for weaker requirements on availability and consistency:
- Basically Available - Basic availability
- Soft-state - Soft state/flexible transaction. "Soft state" can be understood as "connectionless," while "Hard state" is "connection-oriented."
- Eventually Consistency - Final consistency, also the ultimate goal of ACID.
ACID vs BASE
ACID | BASE |
---|---|
Atomicity | Basically Available |
Consistency | Soft state |
Isolation | Eventual consistency |
Durability |
NoSQL Database Classification
Type | Representative Examples | Characteristics |
---|---|---|
Columnar Storage | Hbase Cassandra Hypertable | As the name suggests, data is stored by columns. The biggest feature is the convenience of storing structured and semi-structured data, facilitating data compression, and having significant I/O advantages for queries on certain columns or several columns. |
Document Storage | MongoDB CouchDB | Document storage generally stores data in a format similar to JSON, storing document-type content. This allows for the indexing of certain fields, enabling some functionalities of relational databases. |
Key-Value Storage | Tokyo Cabinet/Tyrant Berkeley DB MemcacheDB Redis | Allows quick querying of values by keys. Generally, the storage does not care about the format of the value, accepting it as is. (Redis includes other functionalities) |
Graph Storage | Neo4J FlockDB | The best storage for graph relationships. Traditional relational databases are inefficient and inconvenient for such tasks. |
Object Storage | db4o Versant | Operates the database through syntax similar to object-oriented languages, accessing data through objects. |
XML Database | Berkeley DB XML BaseX | Efficiently stores XML data and supports XML internal query syntax, such as XQuery, XPath. |
Who is using
- Mozilla
- Adobe
- Foursquare
- Digg
- McGraw-Hill Education
- Vermont Public Radio