How to learn data modeling
In this blog post, I want to share with you my take on how to learn data modeling based on my own experience in the field. Before I get to…
In this blog post, I want to share with you my take on how to learn data modeling based on my own experience in the field. Before I get to tips on specific areas and resources to study, I’ll go over how I see the main data modeling phases and data model types to make sure we’re talking the same language as ambiguity sometimes exists.
Basic terminology
I know you probably already know these terms, but if you don’t, here you go:
Data is a collection of facts and statistics about the world around us, such as your height, distance to a beach, order status, or a temperature reading.
A data model is an abstract representation or blueprint that defines the structure, relationships, and properties of data we care about to help us understand it, organize it, and manipulate it.
Data modeling is then the process of creating and designing a data model.
A database in the computer world, is typically a piece of software that allows us to efficiently store, retrieve, and manipulate the data. The data model defines how the data will be organized, what types of data will be stored, and how different pieces of data relate to each other within the database.
And why is data modeling important? Because anywhere you want to efficiently work with data, you need a data model — not just in databases, but databases are what I’ll focus on in this post.
Three stages of data model design
When it comes to data modeling, there are three frequently discussed stages — or levels of abstraction — in data model development:
Conceptual data model — a non-technical, high-level description of relevant entities and relationships.
Logical data model — includes more details, such as attributes we want to track for each entity. Still business-friendly.
Physical data model — a database-specific model that governs how the data will be stored.
In simpler terms, it progresses from: “We’ll have customers buying products in our great new app” (conceptual) to “Each customer can buy multiple products, and each product can be bought by multiple customers. Also, Customer ID uniquely identifies each customer…” (logical) to “Customer ID will be an auto-incrementing integer because that’s what works best with the database XY” (physical).
All these models can be captured and documented in various ways, but one of the most frequently used is an Entity Relationship Diagram (ERD).
Note: What is a semantic model?
A popular term nowadays is the semantic model. The details may vary vendor by vendor, but it’s a flavor of a logical data model meant to abstract users from the complex and technology-specific physical model. For example, the analytics platform GoodData offers a semantic model/layer that the end-users — oftentimes business users — can interface with and build analytics on top of without worrying about the complexities of the underlying physical data model. There is much more that could be said about it, but I’ll leave that for another blog post — let me know if you‘d like to learn more!
How technical do you need to be?
In the initial stages of data model development, you don’t need to have a deep understanding of databases. What is more important at that point is that you have great communication skills, can facilitate data discovery sessions with business owners, and understand the business rules. Only when you get to the level of the logical model should you have some measure of understanding of the different modeling techniques related to different types of data applications (e.g., transactional database vs. data warehouse), and definitely at the physical data model level, you have to be an expert on the selected database.
That said, when you have small teams, oftentimes all levels of the modeling are handled by the same person, and some stages can be skipped or at least not formally documented.
For the purpose of this blog post I’ll assume you want to be able to do the data modeling end to end.
Types of data models
Apart from the different phases of the data model design (conceptual, logical, physical), there are also different types of models based on use case and technology.
Different use cases
When developing your data model, you have to know what it will be used for. Will it be used to store application data for an eShop? Will it be used to perform data analytics on top of your company’s data from various data sources? Will it be used to support navigation and help find the best paths between locations? Based on the use case, you’ll pick a different data modeling approach and technology.
Two very frequently used approaches to data modeling are (although there are many more):
Transactional data (OLTP) — is a type of database processing that focuses on managing and executing high volumes of real-time transactional operations efficiently. For example, storing data for an eShop. The most frequently used data modeling approach here is called the 3rd normal form.
Analytics data (OLAP) — is a type of database processing that focuses on analyzing large volumes of historical data to support complex business intelligence and decision-making tasks. The most frequently used data modeling approach here is called dimensional data modeling.
At the latest, at the logical data model level, you’ll have to take into account the selected data modeling approach.
Different types of databases
Most databases are meant to be used for specific use cases. Some store data in tables, some store data as documents, some store data as graphs, some as key-values, etc. Some databases scale better, some support many concurrent users accessing the data at the same time, some are better for frequent writes, some are better for running complex queries, etc.
In short, based on your use case, you’ll narrow down the list of databases that support it well. You might still be left with a large number of databases that, for example, support data analytics and can handle concurrent users (if those are your requirements). You can then narrow down the list further based on pricing, etc.
The database you select will then have an impact on the physical data model design. The logical data model should ideally look the same no matter which specific database you’re using.
There are two terms I have to mention here as you’ll run into them soon and frequently. At a high level, people categorize databases as either relational databases or NoSQL databases (that is how all databases that are not relational databases are called). The concept of a relational database is very old and still massively used. Relational databases are the databases with tables that store structured data, which you can query using SQL query language, and you’ll probably find at least one of them in every company. Examples might include MySQL, PostgreSQL, Snowflake,…
NoSQL databases came around to help us work with large volumes of semi-structured and unstructured data. Examples are MongoDB, Redis, Couchbase,…
There are many subtypes of NoSQL databases — graph, document, key-value store, and others.
Ok, I think that’s enough about terminology… uff. Feel free to comment if something doesn’t make sense to you or if you think about it differently.
What to learn?
Now you know that there are three commonly recognized data models based on the stage (conceptual, logical and physical) and that you don’t need to be very technical to work on the conceptual models, and possibly might also wing a logical one, but you do need to be very technical to design the physical data models.
In some companies, you’ll have different roles that work on different levels of the data model — for example, you might have a business analyst or data architect working on the conceptual and logical data model while a data engineer works on the physical model. In other companies, it will be the same person doing all three.
You also know that the logical data model varies based on the use case (e.g., transactional data vs analytics) but shields you from the specifics of the selected database. The physical data model is designed based on the specifics of the selected database.
Assuming you want to eventually be able to tackle all three levels of data model design, this is where I’d recommend you start:
Basics of computer and cloud architecture: how is data stored and processed by computers and moved across networks. At least have a basic idea before you move forward.
Database Fundamentals: Understand the basics of databases, how they are structured, and the different types of databases and their uses (relational, NoSQL, etc.).
Dive into relational DBs (great start unless you already know you want to do something else):
- Learn to use a database (manipulate and query data using SQL).
- Learn the principles of data modeling for transactional databases (3rd normal form). You can learn the basics from the book “Database Design Solutions” by Rod Stephens.
- Learn the principles of data modeling for analytics (dimensional data modeling). A good book, even though a bit old, is “Star Schema The Complete Reference” by Christopher Adamson.Pick a technology (specific DB) and deep dive into that. If you don’t have a specific use case you want to focus on, go for one of the most used databases.
Optionally, if you’re interested in a specific industry, research examples of data models for the industry. There are tons of examples of data models for fintech, eCommerce etc.
A little bit out of order but very important and hard to teach would be business requirements analysis. Basically, learning how to gather and analyze business requirements to ensure the data model aligns with the organization’s needs.
Data Security and Privacy: Understand the importance of data security and privacy regulations to design secure and compliant data models.
If you’d like to get somebody else’s list of resources check out this curated list of books and courses from a data engineering Reddit community.
Whatever you do try not to get stuck in a tutorial hell :)
Summary
I hope this gave you some idea on where to start. One more thing I want to note is that data modeling doesn’t exist in a vacuum. Frequently it is a part of data management or data engineering roles. Some nice books that cover more than data modeling are:
“Fundamentals of Data Engineering” by Joe Reis and Matt Housley
“Designing Data-Intensive Applications” by Martin Kleppmann