SCALE 8x: Relational vs. non-relational
PostgreSQL hacker Josh Berkus set out to do some "mythbusting" about
differences in database technologies in his talk
at SCALE 8x. While there are plenty of differences between the various
approaches taken by database systems, those are not really the ones
that are being highlighted by the technical press. In particular, the
so-called "NoSQL movement" makes for a great soundbite, but is "not
very informative or accurate
". Berkus went on to survey the current
database landscape while giving advice on how to approach choosing
a database for a particular application.
This is a "more exciting time
" to be a "database
geek
" than ever before, he said. Looking back seven years to 2003,
he noted that there were essentially seven different free choices, all of
which are SQL-based. In 2010, there are "dozens of new databases
breeding like rabbits
", with some 60 choices available. As an
example of how quickly things are moving, Berkus noted that while he was in
New Zealand at linux.conf.au, where a colleague was giving a related talk,
two new databases
were released.
Mythbusting
Berkus likened the NoSQL term to a partition
that is created by putting dolphins, clown fish, and 1958 Cadillacs on
one side and octopuses, Toyota Priuses, and redwood trees on the
other—labeled as the "NoFins" group. The non-relational
databases that are lumped together as NoSQL have "radically
different
"
organizations and use cases. But, that's not just true of the
non-relational databases, it's also true for the various
relational databases as well.
Another myth that he pointed out was the "revolutionary" tag that gets associated with all of the new types of databases. Once again, that is a convenient soundbite that isn't accurate. He has not seen a new database algorithm since 2000, and all of the new crop of database systems are new implementations and combinations of earlier techniques. The new systems are not revolutionary, just evolutionary.
As an example, he put up a slide with the following description of a
database: "A database storing application-friendly formatted objects,
each containing collections of attributes which can be searched through a
document ID, or the creation of ad-hoc indexes as needed by the
application.
" He noted that it applies equally well to one of his
current favorites, CouchDB, which was created in 2007, and to the Pick
database system—the original object of the description—which
was created in 1965.
Instead of a revolution, what we are seeing now is a "renaissance of
non-relational databases
". That description is far more accurate,
Berkus said, and is a better way to view the change. It is a "big
thing
" that is going to "change the way that people use
databases
", so it is important to label it correctly.
Another myth is that non-relational databases are "toys", which is something that is often pushed by people who work on relational systems. Berkus pointed out that many SCALE sponsors would disagree: Google using Bigtable, Facebook using Memcached, Amazon with Dynamo, and so on.
The other side of that myth is that relational databases will become
obsolete. Unsurprisingly, that myth is often promulgated by those who work on
non-relational databases, and it is something that the relational community
has heard before. Berkus pointed to a keynote speech in 2001 proclaiming
that relational databases would be replaced with XML databases. He then
asked if anyone even remembered or used XML databases; when even the
crickets were silent, he pointed out that various relational and
non-relational databases had hybridized with XML databases, incorporating
the best features of XML databases into existing systems. He predicted
that "over the next five years, we will see more
hybridization
" between different types of database technologies.
"Relational databases are for when you need ACID transactions
"
was myth number five. Support for transactions is "completely
orthogonal
" to the relational vs. non-relational question. There
are systems like Berkeley DB and Amazon Dynamo that provide robust
transactions in non-relational databases, as well as MS Access and MySQL
that provide SQL without transactions.
The final myth that needs busting is the Lord of the Rings inspired "one ring theory of database
use
", Berkus said. There is "absolutely no reason
" to choose one
database for all of one's projects. He recommends choosing the database
system that fits the needs of the application, or to use more than one, such as
MySQL with Memcached or PostgreSQL with CouchDB. Another alternative is to
use a hybrid, like MySQL NDB, which puts a distributed object database as a
back-end to MySQL, or HadoopDB which puts PostgreSQL behind the Hadoop
MapReduce implementation.
So, what about relational vs. non-relational?
Relational databases provide better transaction support than non-relational databases do, mostly because of the age and maturity of relational databases, Berkus said. Transaction support is something that many open source people don't know about because the most popular database (MySQL) doesn't implement it. Relational databases enforce data constraints and consistency because that is the basis of the relational model. There are other benefits of today's relational databases, he said, including complex reporting capabilities and vertical scaling to high-end hardware. He also noted that horizontal scaling was not that well-supported and that relational databases tend to have a high administrative overhead.
On the question of SQL vs. Not-SQL, Berkus outlined the tradeoffs. SQL promotes portability, multiple application access, and has ways to manage database changes over time. There are many mature tools to work with SQL, but SQL is a full programming language that must be learned to take advantage of it. Not-SQL allows fast interfaces to the data, without impedance-matching layers, which in turn allows for faster development. Typically, there are no separate database administrators (DBAs) for Not-SQL databases, with programmers acting in that role.
"It's always a tradeoff
", Berkus said, but one place that a
SQL-relational database makes the most sense is where you have
"immortal data
". If the data being stored has a life
independent of the specific application and needs to be available to new
applications down the road, SQL-relational is probably the right choice.
How to choose
For other situations, you need to define the "features you actually
need to solve that particular problem
" plus another list of features
you'd like, "then go shopping
". Chances are, he said, there
is a database or combination of databases that fits your needs. He then
went on to some specific application requirements, suggesting possible
choices of database or databases to satisfy them.
-
I need a database for my blog: "
use anything
", including MySQL, PostgreSQL, SQLite, CouchDB, flat files, DBase III, etc. Pick "whatever is easiest to install
" because "it doesn't matter
". - I need my database to unify several applications and keep them
consistent: For example a data warehousing application written C/C++
with reporting tools in Ruby and Rails, should use an OLTP SQL-Relational
database
like PostgreSQL. He also couldn't resist noting that the PostgreSQL 9
alpha was released the day before: "
download it and test it out
". - I need my application to be location aware: a geographical
database, such as PostGIS, is needed. Geographical databases allow queries like "what's near"
and "what's inside".
- I need to store thousands of event objects per second on embedded
hardware: db4object is probably the right choice, but SQLite might also
be considered.
- I need to access 100K objects per second over thousands of web
connections: Memcached is a distributed in-memory key-value store,
which is used by all of the biggest social networks. It can be used as a
supplement to a back-end relational database. He also mentioned Redis and
TokyoTyrant as possible alternatives.
- I have hundreds of government documents I need to serve on the web
and mine for data: It's hard to get the government to release the data,
so the structure of the data may not come with it, which means that the structure must be
derived from examining the documents. For that, he suggests CouchDB.
- I have a social application and I need to know
who-knows-who-knows-who-knows-who-knows-who: This is a very hard
problem for relational databases and what's needed is a graphing database
such as Neo4j. Long chains of relationships are difficult for relational
databases, but graphing databases, used in conjunction with another
database, can handle these kinds of queries, as well as queries to find
items "you
may also like".
- and so on ...
The slides [PDF]
from Berkus's talk have additional examples. The basic idea is that
"different database systems do better at different tasks
" and
it is impossible for any database system to do everything well, "no
matter what a vendor or project leader may claim
". For those who
are looking for open source solutions, he recommended the Open Source Database survey which Selena
Deckelmann has put together. While it is, as yet, incomplete, it does
list around a dozen lesser-known database systems.
It is clear from the talk that it is an exciting time to be a
database developer—or user for that matter. There are many different
options to choose
from, each with their own strengths and weaknesses, some of which can be
combined in interesting ways. It is also
very clear that there are many more axes to the database graph than just the
overly simplified SQL vs. NoSQL axis that seems to dominate coverage of these
up-and-coming database systems.
Index entries for this article | |
---|---|
Conference | Southern California Linux Expo/2010 |
Posted Mar 4, 2010 3:19 UTC (Thu)
by Webexcess (guest, #197)
[Link]
Posted Mar 4, 2010 5:20 UTC (Thu)
by ringerc (subscriber, #3071)
[Link] (1 responses)
... if you're talking about ancient MySQL 3.x, sure. It'd be rather more accurate to say "... because the default MyISAM storage system used by the popular MySQL database doesn't support it." I only hope the quote is inaccurate or an oversimplified paraphrase, because I'd expect better of him. I'm a PostgreSQL user for good reasons, but I still get bugged by people pushing outdated misinformation about MySQL. It has more than its fair share of *real* faults and limitations; no need to hang onto ones that're thankfully in the past :-P
Posted Mar 4, 2010 15:30 UTC (Thu)
by felixfix (subscriber, #242)
[Link]
Posted Mar 4, 2010 5:35 UTC (Thu)
by ringerc (subscriber, #3071)
[Link]
Also: Whichever database system(s) you choose, you need to learn about them and understand them. This is particularly true for more powerful (and complex) systems like relational databases, but still applies to even simple key-value stores. You don't just need to understand the system, though, but also established techniques and ideas about how to best use the system(s) you've chosen. I see this a lot on the PostgreSQL mailing lists. For example, people will ask "why is my bulk data load so slow" and it'll turn out they're doing a million individual INSERT statements, one per record, over a high latency network - and they're running each as a standalone transaction. Each sequentially-executed statement requires at least one network round trip, plus an fsync() to make sure the data has hit disk before the server can reply with a completion notice, so of course it's slow. Get them to use COPY, or batch the inserts into large multi-VALUEd blocks inside one transaction and suddenly it's a thousand times as fast. In addition to that sort of thing - learning about how to best use the technology you've chosen - you also need to understand how to manage the data. Some simpler non-relational databases may not require formal schema definitions, but you still need to understand how data will be stored, retrieved, pruned, etc - and failing to plan for that will lead to a nightmare down the track.
Posted Mar 4, 2010 9:07 UTC (Thu)
by irabinovitch (guest, #30346)
[Link]
Posted Mar 4, 2010 14:14 UTC (Thu)
by nye (subscriber, #51576)
[Link] (5 responses)
It seems a truism that any product described by its adherents as 'revolutionary' isn't.
(Off the top of my head I'm having a hard time even trying to come up with any examples of revolutionary software changes - this is in part the reason why I think sw patents are fundamentally wrong.)
Posted Mar 4, 2010 17:30 UTC (Thu)
by iabervon (subscriber, #722)
[Link] (3 responses)
When something claims to be revolutionary, the claim is generally really that its evolution has just made the revolutionary change available to people who want a system that works and don't want to redesign their application regularly to account for recent discoveries.
Posted Mar 8, 2010 0:32 UTC (Mon)
by dmag (guest, #17775)
[Link] (2 responses)
- Ability to just "add nodes" to help performance, without massive administration overhead. This is non-trivial: You need merkel trees, vector clocks, gossip protocols, etc. See Cassandra and Riak (and the Dynamo paper).
- Ability to answer queries quickly - The RDBMS has tons of overhead, such as query parsing, index selection, query optimization, etc. Some of the Key-Value stores can answer a query before an RDBMS is finished figuring out what the query is.
- Ability to forget about the database enforcing schema -- that's the application's domain. (In the old days, the database had to enforce constraints because it was the "integration point". Now every app has an API, so the database is only used by one app.)
- Ability to forget about writing to disk. Writing to disk means you have to run disk-based recovery (fsck, special recovery code, etc.) It's slow and especially useless in the cloud (on EC2, you can instantly loose your disk and you don't get it back). Instead, if you want HA, you should do replication with fail-over (in multiple data centers.)
- Ability to forget about consistency - It's *much* easier to achieve HA if you drop consistency. Consistency isn't always a business requirement. (See Amazon, see airline overbooking policies)
Posted Mar 12, 2010 0:54 UTC (Fri)
by robert_s (subscriber, #42402)
[Link] (1 responses)
Posted Mar 12, 2010 19:44 UTC (Fri)
by dmag (guest, #17775)
[Link]
No, I didn't mean that. I wouldn't want Postgres to add all of those features, because it would be lots of extra complexity to get some "simple" features.
I was pointing out some limitations of current systems (since most people are still in the denial stage, saying "what's wrong with the RDBMS?") The fundamental concept behind the NoSQL movement is "we need new tools in the toolbox", and there are a lot of "new tools" to explore.
Posted Mar 7, 2010 7:46 UTC (Sun)
by njs (guest, #40338)
[Link]
rsync? Diffie-Helman/RSA?
The GPL?
But yeah, it's like the old rule about fields whose name contains the word "science"...
Posted Mar 5, 2010 6:03 UTC (Fri)
by skissane (subscriber, #38675)
[Link] (3 responses)
Likewise, any query language you want, you can use to query a SQL database. Once more, you just need to write the parser for your favourite query language, and some code to execute it.
In both cases, you may be able to reuse the target systems query engine to some extent, but may need to supplement it with your own code in some cases. For example, if I want to map SQL to LDAP, LDAP has equivalents to SELECT, FROM, WHERE, ORDER BY (if I use the sort control); but LDAP has no equivalents to JOIN or aggregates, so you'd have to write your own code for those.
SQL is not my favourite language -- if I was designing it, I would make the syntax much less COBOL-ish. I think its a pity we don't use something like QUEL or Tutorial D or BS12's query language. On the other hand, its the industry standard.
But please, no more trying to distinguish databases based on whether they support SQL or not. Any database can support SQL, or any other query language you want -- at most, its just a question of whether the database has that support builtin, or if you have to build it yourself. Its just an interface issue.
And also, too many people attack "relational" databases when they really mean to attack various real world implementations of relational theory. Most of their gripes are really with the implementations rather than the theory, and often the solution is not to junk the theory, but to improve the accuracy of its implementations.
Posted Mar 7, 2010 14:17 UTC (Sun)
by kleptog (subscriber, #1183)
[Link] (2 responses)
For example, suppose your database consists of a objects with attributes, dynamically created, which can contains references to other objects, not necessarily of the same types. Maybe there are no "types" at all and all objects are exactly as they come. They may reference other objects arbitrarily.
While you might try to make something that looks like SQL for such a database, but it's unlikely to be practical.
I think NoSQL should be reserved for databases which don't follow the relational model at all, and thus are unsuitable for SQL even if you wanted it to.
Posted Mar 7, 2010 16:53 UTC (Sun)
by bronson (subscriber, #4806)
[Link] (1 responses)
I had this last year: the data itself fit perfectly into tables with foreign keys. Problem is, we were trying to allow marketing people to slice and dice it in fairly arbitrary ways, and their needs would change from week to week.
This was typical data warehouse-type stuff. Compute the monetary total of all orders for customers from this region. If that's above a value provided by marketing, then what's the average zip code and standard deviation for the remaining customers who have dogs, etc. (you get the idea)
I wish I had done the whole thing in MongoDB using map-reduce. I think it would have been a lot faster, both to develop and to run. I wouldn't have to spend as much time figuring out which indices, counter caches, and denormalization that would be needed to make this week's reports complete in time.
So, even if your data model is nicely tabular, that doesn't mean your usage patterns will be!
Posted Mar 11, 2010 10:35 UTC (Thu)
by gvy (guest, #11981)
[Link]
Posted Mar 5, 2010 14:52 UTC (Fri)
by rgoates (guest, #3280)
[Link]
Posted Mar 5, 2010 17:37 UTC (Fri)
by dps (guest, #5725)
[Link]
If you want fast access to small portion of a large data set then databases obviously fit. Whatever the interface all databases offer efficient ways to find the portion of the data you want. The relational and security features of SQL databases are potentially valuable for some applications.
While I have not explored them in detail what I have seen of XML and NoSQL databases strikes me as inferior to the SQL alternative in almost all respects. The only exception is Berkley DB which is good in many instances where SQL would be serious overkill.
Posted Mar 5, 2010 19:13 UTC (Fri)
by dps (guest, #5725)
[Link]
I have actual experience that demonstrates rolling back many updates of several tables works as advertised. If you are looking for things missing from MySQL then OLTP support is probably a good place to start looking.
I suspect that the OLTP support is MySQL is somewhere between nonexistent and minimal but my knowledge might be out of date.
Posted Mar 5, 2010 19:15 UTC (Fri)
by einstein (guest, #2052)
[Link] (2 responses)
Just for the record, from the official mysql documentation: http://dev.mysql.com/doc/refman/5.0/en/commit.html
Comparing the postgres of tomorrow with the mysql of 1997, way to go!
Posted Mar 5, 2010 19:34 UTC (Fri)
by jake (editor, #205)
[Link] (1 responses)
Wow, I am going to have to disagree. While Josh was a tiny bit PostgreSQL partisan in his talk, he was most certainly not unstable or embittered. If my report makes him seem that way, then I reported poorly, for which you (and he) have my apologies.
jake
Posted Mar 7, 2010 23:01 UTC (Sun)
by tack (guest, #12542)
[Link]
Your wording most certainly did not "lump together MySQL and MS Access." The only legitimate complaint concerning MySQL (InnoDB supports transactions) has been pointed out and defended (MySQL+MyISAM is the default, and therefore "the most popular database").
Posted Mar 9, 2010 14:28 UTC (Tue)
by smitty_one_each (subscriber, #28989)
[Link]
Posted Mar 11, 2010 14:05 UTC (Thu)
by jwmittag (guest, #43097)
[Link]
I always thought the NoSQL moniker was a quite unfortunate choice of
naming, because it conveys exactly the opposite of what the whole
NoSQL movement (another not quite so fortunate moniker) is all about:
choosing the right tool for the job. Ruling out an entire class of
databases is certainly not about choosing the right tool for the
job. In fact, it is exactly what the NoSQL movement complains
about. A much better name would be the "Not-automatically-SQL"
movement or "Not-only-SQL" movement. Or, as Ben Scofield calls it, "polyglot
persistence" (in analogy to the polyglot programming movement, which
has been very successful in bringing the point across that there are
domains where Java might not be automatically the best choice). What
polyglot persistence is really about is changing the persistence
question from "Which Edition of Oracle is the right tool for job" to "Which
data organization, query algorithms, atomicity level, consistency level,
isolation level, durability level is the right tool for the job". It is
also about offering fast, scalable, dependable, robust, mature
persistence solutions for every point in that design space, but the first
goal is the more important one, IMHO.
PostgreSQL conference
"Transaction support is something that many open source people don't know about because the most popular database (MySQL) doesn't implement it."
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
Session audio is available online as an mp3
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
I've met at least two bright examples of non-relational RDBMS which did shine where My|Pg or Ora just would not:
Regarding hierarchical RDBMS
Basically, if you have to do things like "this attribute holds a value changing the *meaning* of that attribute", then you just reinvented a hierarchical database where adding another leaf might be less pain and overhead. And you might have wanted to look a bit wider. :)
There was e.g. a discussion on sql.ru describing the details of a migration off a "legacy" hierarchical system to Java and Oracle -- which "doubled the performance" (forgetting to mention the need to go dual Xeon 51xx and external storage from something like dual Pentium with SCSI).
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
MySQL *does* have transactions
SCALE 8x: Relational vs. non-relational
SCALE 8x: Relational vs. non-relational
> of the success of the competition and consumed with putting them down
> at every opportunity.
SCALE 8x: Relational vs. non-relational
Minor MS Access point
http://msdn.microsoft.com/en-us/library/ms807955.aspx
Not that we'd go admitting familiarity with Redmond products, mind you.
The unfortunate "NoSQL" moniker and Polyglot Persistence