The Simplest Implementation of a Database

❮ Verilog Time Delay Svn Co Github Dir ❯

The Simplest Implementation of a Database

Category Programming Technology

Among all application software, databases are perhaps the most complex.

The MySQL manual is over 3,000 pages, the PostgreSQL manual is over 2,000 pages, and the Oracle manual is even thicker than the combined pages of the former two.

However, creating a simple database from scratch is not difficult. There is a post on Reddit that explains the principles in just a few hundred words. Below is my content based on this post.

1. Data Stored in Text Format

The first step is to write the data you want to store into a text file. This text file is your database.

For ease of retrieval, data must be divided into records, with each record having a fixed length. For example, if each record is 800 bytes long, the start position of the 5th record would be at 3200 bytes.

Most of the time, we don't know the position of a record, only the value of the primary key. To read data in this case, you can compare records one by one. However, this is too inefficient. In practice, databases often store data in B-tree format.

2. What is a B-tree?

To understand the B-tree, you must start with the binary search tree.

A binary search tree is a data structure with very high search efficiency. It has three characteristics:

(1) Each node has at most two subtrees. (2) The left subtree values are less than the parent node, and the right subtree values are greater. (3) To find a target value among n nodes, generally only log(n) comparisons are needed.

The structure of a binary search tree is not suitable for databases because its search efficiency is related to the number of levels. The deeper the data, the more comparisons are needed. In extreme cases, n data items require n comparisons to find the target value. For databases, each level down means another read from the hard drive, which is very detrimental because the read time from the hard drive is much longer than the data processing time. The fewer times a database reads from the hard drive, the better.

The B-tree is an improvement on the binary search tree. Its design idea is to concentrate related data together so that multiple pieces of data can be read at once, reducing the number of hard drive operations.

The B-tree also has three characteristics:

(1) A node can hold multiple values. For example, in the diagram, one node holds four values. (2) A new level is not added unless the data is full. In other words, the B-tree aims for fewer levels. (3) There is a strict order relationship between the values in the child nodes and the parent node. Generally, if a parent node has a values, there will be a+1 child nodes. For example, in the diagram, the parent node has two values (7 and 16), corresponding to three child nodes: the first with values less than 7, the last with values greater than 16, and the middle with values between 7 and 16.

This data structure is very beneficial for reducing the number of hard drive reads. If a node can hold 100 values, a 3-level B-tree can hold 1 million data items. If it were a binary search tree, it would need 20 levels! Assuming the operating system reads one node at a time and the root node is kept in memory, the B-tree would only need two hard drive reads to find a target value among 1 million data items.

3. Indexes

Storing the database in B-tree format only solves the problem of finding data by "primary key." To search by other fields, you need to create indexes.

An index is a B-tree file with a certain field as the key. Suppose there is an "employee table" containing the employee number (primary key) and name fields. You can create an index file for the name, which stores names in B-tree format, each name followed by its position in the database (i.e., which record). When searching for a name, you first find the corresponding record number from the index, then read from the table.

This index search method is called "Indexed Sequential Access Method" (ISAM). There are multiple implementations (such as C-ISAM and D-ISAM libraries), and using these code libraries, you can write a simple database.

4. Advanced Features

After deploying the basic data access (including indexing), you can also implement some advanced features.

(1) SQL Language is the common operation language for databases, so you need an SQL parser to convert SQL commands into corresponding ISAM operations.

(2) Database Joins refer to the connection relationship between two tables in the database through "foreign keys." You need to optimize this operation.

(3) Database Transactions involve batch processing a series of database operations, and if any step fails, the entire operation fails. Therefore, you need an "operation log" to roll back operations in case of failure.

(4) Backup Mechanism: Save a copy of the database.

(5) Remote Operations: Allow users to operate the database from different machines via TCP/IP protocol.

Original source: http://www.ruanyifeng.com/blog/2014/07/database_implementation.html

** Click to Share Notes

Cancel

❮ Verilog Time Delay Svn Co Github Dir ❯