The Brief Tale About the Evolution of Database Systems
From Clay to Zeros and Ones
Although it may not appear so nowadays, databases have been used to store and organize information long before computers were even invented. Ancient civilizations were using clay wedged-shaped marks to keep track of record inventories such as food, armor, and more. With time, as more and more data was required to be stored, databases have grown in size and complexity and so did methods to index and retrieve data.
The amount of data being produced grew exponentially. In fact, the early attempts of estimating the extent of this phenomenon were taken in 1944 by Fremont Rider, who in his paper The Scholar and the Future of the Research Library estimated that Yale Library by the year 2040, will consist of over 200 million books placed on more than 6,000 miles of shelves.
What Rider was not able to predict was the consequences of the digital era and the way it shaped how we store and manage data. When in the early 60's computers became more affordable, private corporations saw the potential of using them to store their data, and the necessity for more elaborate structures of data storage appeared.
File Processing Systems (FPS)
The first computer systems that stored, retrieved, and manipulated data were called File Processing Systems (FPS) and used files organized in a specific hierarchy. The files stored documents and data of various sorts and would be arranged according to their category, making it accessible by different applications. This system, similar to the way we store files on our computers nowadays, was relatively simple to use and cost-effective, as it wouldn’t require any third party to provide those functionalities. However, there were certain disadvantages characterizing FPS, such as redundancy (some information could be duplicated and appear in different places) or security issues (restricting user access was difficult), and more.
Problems inherent to FPS encouraged the creation of the first Database Management Systems (DBMS) that introduced complex algorithms and data structures to overcome those obstacles.
Pointing the Right Direction
In the first DBMS, the data was structured in a navigational way. In systems like this, the relationship between records was formed based on ‘pointers’ or ‘paths’ connecting them. The data was accessed by navigating through those connections. Examples of early navigational data models were the hierarchical model and the network model.
The hierarchical model was first introduced by IBM for their Information Management System (IMS) (previously known as Information Control System and Data Language/Interface (ICS/DL/I)). This tree-like structure consisted of parent nodes pointing to child nodes, where each child record could only have one parent, whereas each parent record could have one or more child records. It was the world’s first database management system and it was used by NASA to keep track of purchase orders for the Apollo 11 Moon mission.
The structure that IMS used was simple but inflexible as it allowed a child to have only one parent (a relationship defined as a one-to-many relationship). A more flexible, database management system called the CODASYL (Conference on Data System Language) based on the network model was developed by G.E. Charles, allowing child nodes to have multiple parents.
The DBMSs architectured based on both hierarchical and network models gained in popularity. But they weren’t without flaws. Systems as such relied on writing complex queries consisted of nested loops to make even the simplest operations. Another problem was that with navigational databases both logical and physical structure was interrelated. The user needed to know the schema in order to retrieve data. Likewise every change in the schema required to rewrite the whole query code.
As the databases grew in size it became laborious to manage the relationships between records and navigational techniques were gradually falling out of favor for the new model that emerged in 1970 -The Relational Data Model.
While working for the IBM, a mathematician E.F. Codd, introduced a new model that he claimed to solve some of the inherent problems of the existing DBMSs. His Relational Data Model was a revolutionary idea that argued that the logical and physical structure of databases should be completely disconnected. That would mean that programmers would no longer need to decide on a physical storage structure beforehand. Instead, The system would figure out the best ways to store the data for them. Rather than having nested, hierarchical or networked structures, the data for a single entity was represented by abstract objects called relations. Those relations, otherwise called tables, had a unique key identifying each row and were connected with each other by data fields, rather than by ‘pointers’.
Moreover, Codd proposed that rather than writing complex queries that would need to be re-written over and over again, there could be used a high-level, declarative language that would tell the system exactly what answer is needed to be computed and the database system would figure out the best way to do it. This all made the data easier to access, merge and modify.
His theory was a foundation for creating several query languages (computer languages design to send queries to DBMS in order to manage data) such as QUEL or SQL.
Codd never commercialized his idea, as IBM had the other very lucrative product-IMS, thus they were slow with supporting his idea. However, his work became the base for creating multiple relational database management systems (RDBMS), that are widely in use to this day. His 13 rules, humorously referred to as “Codd’s Twelve Commandments” (starting from index 0), stated what is required from a DBMS in order for it to be considered relational.
The first software system sold as relational databases was Multics Relational Data Store released in June 1976, followed shortly by IBM’s System R, Oracle, and INGRES (the predecessor of POSTGRES).
During 80's and 90's relational databases grew increasingly dominant introducing features like indexes (improving the speed of data retrieval), table joins (combining multiple tables into one), or transactions (single logical operation on the data). Moreover, at that time SQL became a primary query language used with RDBMS.
Although relational database management systems have been still widely in use nowadays, the beginning of the 21st Century introduced new challenges. Architectured to be in use on a single machine, RDBMSs were extremely difficult to scale out, which became essential with the rise of the Internet. No single server was able to handle that abundance of data coming from millions of users across the network. In addition, the logical structure of relational databases wasn’t very convenient to use with object-oriented programing languages such as Java or C++, as it required remodeling the data in order to fit in rows and columns structure. RDBMS were also large in size, complex, and costly.
The Rise of NoSQL
The creators of relational database systems were trying to add new features on top of their products in order to allow them to scale easier. However, this didn’t stop the new alternative from appearing. NoSQL databases aimed to solve some of the problems inherent to single-node databases. Firstly the term ‘NoSQL ' was used in 1998 by Carlo Strozzi to describe a lightweight, open-source RDBMS that bypassed usage of SQL altogether. The movement was growing in strength as its pro-claimers believed that in most cases key-value storage is all you would need. “Relational databases give you too much. They force you to twist your object data to fit a RDBMS. [NoSQL-based alternatives] just give you what you need,” said Jon Travis, principal engineer at Java toolmaker SpringSource and a presenter at the NoSQL confab. Models developed based on the idea of NoSQL were not only characterized as key-value storages. Document databases, graph databases, or wide-column stores that developed over time, all subsumed into the NoSQL category.
The NoSQL systems were often open-source, more affordable, optimized for retrieving, and appending operations, and could scale horizontally, but not without a cost. To achieve that creators had to sacrifice some of the features that SQL database systems provided, such as transactions, joins, or using SQL.
The Time Is Now
The history of databases doesn’t stop here. Both Relational Databases Systems, which are still the most used ones, as well as NoSQL databases have developed over time and do their best to mitigate their weak points. Other, more abstract models have built upon those ideas. NewSQL, a concept initiated in 2010, is both distributed and relational. About the same time Hybrid Systems (Hybrid Transactional-Analytical Processing) emerged, which similarly to NewSQL were both distributed and relational but on top of that, aimed to bridge the gap between transaction processing and analytics.
The new solutions are appearing constantly and the importance of data has never been as significant as it is right now. From mobile cooking applications to libraries, banks, cars, and fridges, the data is all around us and the systems that manage this data are crucial components of how most things work. As a developer, it’s easy to go for either the newest, most shiny options or the one that everyone is using, not giving a second thought to where it came from. However, learning about the origin of things help us to better understand how they work, make better choices, and maybe even be the one that shapes the future direction of its development. After all, as Confucius once said: “Study the past if you would define the future.”
- Database System Concepts, Sixth Edition, Avi Silberschatz, Henry F. Korth, S. Sudarshan. Chapters:1–2.
- History Of Databases, Kristi L. Berg, Tom Seymour, Richa Goel at International Journal of Management & Information Systems — First Quarter 2013 Volume 17, Number 1
- History of Databases video, by Computer History Museum
- A brief history of big data everyone should read, Bernard Marr, World Economic Forum
- Information Management System, Uri Berman, Carl Chamberlin, Don Lundberg, Larry Morgan, Ed Morris, Vern Watts, IBM website
- NoSQL Databases, Prof. Walter Kriha, Hochschule der Medien, Stuttgart (Stuttgart Media University)
- Course Information & History of Databases [CMU Database Systems Spring 2016], Prof. Andy Pavlo, Carnegie Mellon University
- Course Introduction & Relational Model (CMU Databases Systems / Fall 2019), Prof. Andy Pavlo, Carnegie Mellon University