the blog for developers

Essential storage tradeoff: Simple Reads vs. Simple Writes

3009523653 dccee74f34 m Essential storage tradeoff: Simple Reads vs. Simple Writes
Photo by Robert Scoble

When I studied computer science with a focus on databases, the holy scripture said only to use normalized data. For my exam I had to study 2NF and 3NF. And now it seems this is all wrong for large scale operations. From my education, I still have a strong feeling towards the pureness of normalized databases – and detect this feeling in many developers. In reality denormalized schemas are the way to go.

Why the former rise of normalized schemas for relational databases? Normalized data is pure. There are no repetitions, no duplication and data can easily be added, changed and removed. The database enforces a consistent state. But to read data from several tables you need to join them, which means complex and – often – slow reads. The tradeoff.

The rise of NoSQL storages

Joining doesn’t scale to millions of concurrent users, and it’s rumored that some companies ban joins completely. NoSQL databases, which include key/value stores and document databases, drop the notion of normalized data. Those with the biggest momentum are Cassandra, Project Voldemort, Tokyo Tyrant, Redis and CouchDB.  They store data or documents in denormalized chunks which leads to easy reads. But if the data needs to be present in several documents, this means complex writes.

Simple Reads vs. Simple Writes

Essentially you exchange simple-writes/complex-reads with complex-writes/simple-reads. In most modern web architectures with many more reads than writes – depending on your business model – it’s no wonder people prefer the second tradeoff.

An example

Let’s take a look at an – contrived – example. Suppose we want to store a shopping list. Each shopping list can be tagged and also have friends who can see the shopping list. With a normalized database one would perhaps model this problem like this:

Normalized example for databases

Compare the normalized form with a document schema:

Document based example for NoSQL
We added the tags and friends (in human readable form) to the shopping list, and the friends to the user document.

Read and write use cases

Let’s examine two use cases:

  1. Adding, removing and changing the name of a friend
  2. Reading the shopping list

With the normalized schema, adding and removing a friend to a shopping list is easy: Just insert a row or remove a row in the Friends table. Changing the name of a friend is also easy, just change it in the User table. Reading the shopping list at least involves a join with Tags and so isn’t a one-chunck read access. If we want to display the friends for a shopping list too, there even will be a second join.

With the document schema, adding and removing a friend is more complex. We need to read and write the shopping list document. We also need to read and write the user document. Reading is simple though. Just read the shopping list document, there is all the information to display a shopping list with the tags and the friends. Changing names is most difficult, we would need to change all shopping lists. This is the primary argument for normalized databases. But in reality, how often do users change their name compared to viewing a shopping list? And most of those slow operations can be done in a second, asynchronous step (see picture above).

Be aware of this tradeoff and plan or change your architecture accordingly. But what do you think?

You can leave a Reply here. Of course, you should follow me on twitter here.

You can share this post!
Do you want to tell others about this article? Use the social bookmark icons to submit this artice to the service of your choice. Thanks.

About the author: Stephan Schmidt has more than 15 years of internet technology experience and 10 years experience in agile. He was head of development, consultant and CTO and is a speaker, author and blog writer. He specializes in organizing and optimizing software development helping companies by increasing productivity with lean software development and agile methodologies. Want to know more? All views are only his own.

14 Tweets

Leave a reply.

Comments

Stefan Schubert

Hey Stephan,

very neat and clean presentation of the trade-off those two principles.
Unfortunately there is more involved in such a decision, but hopefully you’ll get into that later ^^

Chears
Stefan

@Stefan: Thanks. Well yes, there are many other tradeoffs. I’m writing a follow up post about the “Dark side of NoSQL” – mainly the lack of adhoc reporting and adhoc data fixing.

Cheers
Stephan

PS: And many more things

Joe

In applications I have worked on, data consistency has been the highest concern. The performance loss to joins has been minimal in almost every case, and the ability to change\add data without having to remember to do it in 10 different tables is essential.

I will likely stick to Normalized Data for the majority of my applications, but I will have to keep an open mind towards DeNormalized data as well, I suppose.

@Joe: Yes, this might be the case under some circumstances and some applications. Those were certainly the mainstream in the 80s and 90s. Internal applications where 1000 concurrent users (flight terminals?) are a high number.

But they only scale with massive caching. Today many applications experiences millions of concurrent users. Then you need to trade consistency with scalability.

As I’ve said to Stefan, the downside is “mainly the lack of adhoc reporting and adhoc data fixing.”. Which sometimes is a major issue.

So for user data, payment data, orders it’s still a good idea to keep your data normalized.

Cheers
Stpehan

+1 this is a great high-level summary of the architectures. The characterisation of a simple trade-off is a good starting point.

Rusty Wright

A good example of where this played out was Twitter, back when it kept falling on its face because it was using an sql database for storing everything.

Yeroc

I think it’s worth mentioning that some SQL databases (Oracle for example) provide out-of-the-box solutions for this problem as well and arguably in a more seamless manner. For example, creating a materialized view based on a complex query gives you the benefits of a denormalized schema. This, in conjunction with query re-writing could mean your application would continue to operate normally without any knowledge of the denormalization going on behind the scenes. Meanwhile, at the application level inserts and updates don’t need to be concerned about manually updating the data across multiple tables since that will be handled automatically via the materialized view mechanism.

@Yeroc: Yes, you are right.

“I think it’s worth mentioning that some SQL databases (Oracle for example) provide out-of-the-box solutions for this problem as well and arguably in a more seamless manner. For example, creating a materialized view based on a complex query gives you the benefits of a denormalized schema.”

There are other benefits of chunks in documents and values though. Distribution comes to mind, which is inherently more difficult (might I say impossible above certain levels) to achieve with Oracle. And of course you can have denormalized data in every SQL database.

Curt Adams

My experience with non-normalized data is that it will become inconsistent, guaranteed. You have to have a normalized backbone for important data. Complex views of normalized data are very useful and I’ve been using a materialized view strategy for 25 years, long before I ever heard of any formal descriptions. I have never been able to get any other system to work.

In terms of chunking, I find extremely complex “materialized views” are very useful. I haven’t used the Oracle system, but I’ve written some pretty complicated update routines that I would be surprised to see in standardized packages. So I think your goals can be best realized (in most environments) with sophisticated systems comparable to materialized views – documents that update automatically based on an underlying verifiable data system.

@Curt: But materialized views are complex writes – done by the DB – to denormalize data. The same I write about in the post. Aren’t they?

Again a beautiful and nice posting, beside the content, which is very good!

For ad-hoc reporting we BI experts are currently facing the arising of coulumn based and column indexed storages, mainly in combination with in-memory technology.

As developer one does data modelling in his / her best way, in preparation for access things are split into columns and links between. Far away from 3NF or documents.

I supose it’s worth having a look at during your further investigations. ‘am looking forward.

Regarsd
Rayk

Eric Z. Beard

I’ve been studying this issue closely for a while now. I have built and maintain a system with hundreds of millions of records that’s growing at about 25% per quarter consistently, so I’m obsessed with data storage. My application relies on hundreds of queries that need to run in real-time against all of that transactional data – no offline cubes or Hadoop clusters. I’m considering a jump to NoSql, but the lack of ad-hoc queries against live data is just a killer. I write probably a dozen ad-hoc queries a week to resolve support issues, and they normally need to run “right now!” I might be analyzing tens of millions of records in several different tables or fixing some field that got corrupted by a bug in the software. How do you do that with a NoSql system?

My solution for the moment is sharded databases, with a healthy dose of de-normalized aggregates sprinkled throughout the model, which are kept in synch with triggers or controlled data access via stored procedures. There’s no chance of data being different in two places.

And there’s no edict on how to write queries – if you join 20 tables and it runs fast, fine. If you need to write parallelized queries to return raw data from one table at a time and then join them on an application server, then that’s fine too.

I think a lot of people new to this game don’t get how good a SQL database can be. On a single shard (a cheap 2U box), I can run 5 million customers, which is something like 150Gb of data, and everything runs fast enough to keep people happy. In a NoSql system, how many nodes would you need to support both transactional and reporting applications for that much data?

If people haven’t, now would be a great time to go and review Brewer’s CAP Theorem!

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

Curt Adams

If the denormalized views are constructed from a normalized database, the normalized data isn’t dropped. It’s still there, and can be used to verify or reconstruct documents. My understanding of the NoSQL movement is that it abandons the goal of a normalized data structure entirely.

[...] Code Monkeyism: Essential storage tradeoff: Simple Reads vs. Simple Writes [...]

[...] Code Monkeyism: Essential storage tradeoff: Simple Reads vs. Simple Writes [...]

[...] The second is a rejection of the strong typing of relational schemas, which make changes to data models, which are inevitable,  disastrously difficult to achieve.  It also makes writing to the data store a complex process. [...]

[...] an old post about the read-/write-cost tradeoff. For most applications reads occur way more often than writes. Therefore this application should [...]

artsrc

This discussion seems to miss the logical/physical distinction.

Whatever you think of the relational model as a logic user model, remember that it is intended as only a logical model.

Of course pointers for foreign keys, co-locating related entities, and relaxing constraints, can help performance. Those things are a part of the physical model.

The weakness is that to a greater or lesser degree, the SQL products do not support some
of the physical models you want. That seems to me to be purely an implementation issue.

Daniel

Even in the document-based version, updating items on it can be very easy and quick depending on what solution you use. For example, if you are using MongoDB, you can just tell it to update the inner content of an element inside a document, like:

shoppinglist.friends[0].name = ‘Johnny’

And you can also store references to other columns, so if your list of friends can be a list of references to other documents describing users (kinda like a relational database would do), so all you would need is, too, to update the user document.

You can also have server-side functions written in javascript to do these kind of updates, so that you need not transfer info around server client, just call the server-side function and it will do the update locally. You can also send plain javascript as a query, like an UPDATE, to do what is needed.

Peter

This illustrates the emerging issue of relational data modeling vs object data modeling… the evidence is that a paradigm shift is imminent, unfortunately I think the writing is on the wall for use developers and sys admins alike, we will come to miss the simplicity of the relational model.

dave

think’s, it’s sobering !

but the problem is frameworks and ORM are also developped for relationnel datas

for example in PHP (yeah I know ! but now with frameworks and architecture it’s powerful for “web2.0″ or social, see drupal or facebook!) there is a very good ORM like Doctrine for data components.
but if you want to use these new model, I suppose you have to throw all ORM you liked so muche and make yourself all data work. paintful !

don’t know if it’s supported in Rails ?

for me the problem is not if we can use these database, but if they are integrated in application frameworks.

[...] Bases de données dénormalisées : Article très instructif sur la nouvelle génération de bases de données dites [...]

Leave a Reply

What people wrote somewhere else:

My new blog post : RT @codemonkeyism “Essential storage tradeoff: Simple Reads vs. Simple Writes” http://bit.ly/DoBPM

This comment was originally posted on Twitter

good post on current trends towards nosql databases http://bit.ly/T4Uk (by @codemonkeyism)

This comment was originally posted on Twitter

RT @codemonkeyism My new blog post : RT @codemonkeyism “Essential storage tradeoff: Simple Reads vs. Simple Writes” http://bit.ly/DoBPM

This comment was originally posted on Twitter

RT @codemonkeyism: My new blog post : RT @codemonkeyism “Essential storage tradeoff: Simple Reads vs. Simple Writes” http://bit.ly/DoBPM

This comment was originally posted on Twitter

What other trade-offs are there between NoSQL and SQL, normalized and denomarlized? Blog post here: http://bit.ly/DoBPM :-)

This comment was originally posted on Twitter

RT @codemonkeyism: … trade-offs are there between NoSQL and SQL, normalized and denomarlized? Blog post here: http://bit.ly/DoBPM :-)

This comment was originally posted on Twitter

RT @codemonkeyism “Essential storage tradeoff: Simple Reads vs. Simple Writes” http://bit.ly/DoBPM

This comment was originally posted on Twitter

RT @sbtourist: RT @codemonkeyism “Essential storage tradeoff: Simple Reads vs. Simple Writes” http://bit.ly/DoBPM #nosql

This comment was originally posted on Twitter

Essential storage tradeoff: Simple Reads vs. Simple Writes via (@codemonkeyism) http://bit.ly/2RH2Fm

This comment was originally posted on Twitter

RT @codemonkeyism What other trade-offs are there between NoSQL and SQL, normalized and denomarlized? Blog post: http://bit.ly/DoBPM :-)

This comment was originally posted on Twitter

Essential storage tradeoff: Simple Reads vs. Simple Writes – 传统关系型数据存储和KV/文档数据存储的本质区别 http://tinyurl.com/lgyohw

This comment was originally posted on Twitter

I like @codemonkeyism’s explanation of web-scale storage approaches and the NoSQL meme. http://bit.ly/MN101

This comment was originally posted on Twitter

Non-Relational DB http://bit.ly/eATda

This comment was originally posted on Twitter

Essential storage tradeoff: Simple Reads vs. Simple Writes: http://rurl.org/1y6m

This comment was originally posted on Twitter

Additional comments powered by BackType

Guide to CodeMonkeyism

Over the last 4 years I wrote many articles on this blog. To make it easier for you to find the relevant ones, I've organized them into topics.

Top 10

6 reasons why my VC funded startup did fail

Go Ahead: Next Generation Java Programming Style

Java Interview questions: Write a String Reverser

The dark side of NoSQL

7 Bad Signs not to Work for a Software Company or Startup

Is Java dead?

Scala vs. Clojure

Never, never, never use String in Java

No future for functional programming in 2008 – Scala, F# and Nu

Clojure vs Scala, Part 2

Java Developer

Is Java Dead?

Go Ahead: Next Generation Java Programming Style

Be careful with magical code

All variables in Java must be final

Never, never, never use String in Java

Bending Java: More readable code with methods that do nothing?

NoSQL Guy

NoSQL: The Dawn of Polyglot Persistence

The dark side of NoSQL

Essential storage tradeoff: Simple Reads vs. Simple Writes

Sharding destroys the goals of your relational database

The unholy legacy of databases

Startup/CTO

Development Dream Teams

6 reasons why my VC funded startup did fail

American vs. European style of Software Development

12 Things to Reduce Your Lead Time and Time to Market

The high cost of overhead when working in parallel

Essential storage tradeoff: Simple Reads vs. Simple Writes

Job Seeker

Another Good (Java) Interview Question

7 Bad Signs not to Work for a Software Company or Startup

Java Interview questions: Write a String Reverser (and use Recursion!)

Java Interview questions: Multiple Inheritance

As a Manager: What I value in developers

Top 10 Tips (+1) to Get a Pay Raise

Agilist

What Developers Need to Know About Agile

5 Practices Better to Change in Your Scrum Implementation

Scrum is not about engineering practices

ScrumMaster and ZenMaster: The joke of certification

What is Trans-Scrum?