The Google Mirror Site
 
 
 

Monotonically decreasing jackassery

Warning: nerdy post. If you come here for laughs, you will not find them in this post. Unless you like laughing at ignorance, in which case, strap on your chuckle boots and start wading.


I feel like I'm constantly having a discussion about whether or not database schema should have an numeric, auto-increment, primary keys. I'm usually on the side of 'yes, stupid' but for some reason I find that I'm in charge of managing a lot of databases that do not have one of these columns for each table in the database. Not that we don't have primary keys, of course, but rather that those primary keys are often made out of the intrinsic properties of the data being stored, and not something imposed from outside.

I've only recently learned that the question at hand is whether a table should have a Natural or Surrogate primary key. Natural primary keys are built directly from your data: If we are storing some collection for a user of a system, perhaps the natural key would be the name of the collection plus the uid of the user. This makes a lot of sense, kind of - the name of the collection shouldn't change, the number of the user shouldn't change, and so those things together should make a pretty solid primary key. Trivial!

Surrogate keys seem kind of dumb at first glance. The most common (to the best of my knowledge) surrogate key is a unique integer id assigned to every row in your table. It's definitely wasteful, since it appears to necessarily denormalize your database (as I understand it, anyway. If there is another collection of columns in the db that uniquely identifies the record, than having a separate unique column is redundant).

Even though surrogate keys might seem dumb, and natural keys might seem to make sense, I really have to keep reminding myself never ever to let anyone talk me into using a natural key for a table again. Ever. You know why? Because when someone tells you that some piece of data is not going to change, they are lying to you. They are probably not being malicious, but it doesn't make it any better. So you might find yourself in a situation where (hypothetically), someone wants some portion of a natural key changed, so the only way to do it is to delete the record and create a new one that looks just like it, but has a different value in that portion of the natural key; but some other programmer has (supposedly) leveraged the fact that, since primary keys do not change, he can use the create date of the record to sort the data. But after you delete and recreate the record, you need to go in there and manually change the create date of the record, which is a disaster.

I will accept that you probably shouldn't be using the create date as a sorting key in your application code, but here we (putatively) are. If a surrogate key was used, you could change ANYTHING in the record without any problems, since the (single, unique) primary key wouldn't change. As a surrogate, it is NECESSARILY never going to change - it exists in a space that's completely orthogonal to your data, and so will never change if your requirements change.

Then the only problem is keeping some jackass programmer from using it to sort in his app code. Attack that next - we strive for monotonically decreasing jackassery in our code, so take it one step at a time.

Reference: Some page on ASP FAQ.com, which I liked because not only does it use the phrase "pissing contest", but it also provides a link to Usenet.

TrackBack

TrackBack URL for this entry:
http://www.alltooflat.com/cgi-bin/mt-tb.cgi/226

Comments (2)

Here's a good rule of thumb: ALWAYS use a surrogate key with a generated value.

Trust me. You will be much happier. There is (pretty much) no such thing as a natural key that never changes.

There are 2 very pragmatic reasons to use surrogate keys too:

1. All the ORM frameworks rely on them by default. You can certainly override them, but your life will be much simpler if you just go with the flow.

2. Every programmer can obviously understand what a primary surrogate key is. No documentation necessary. Trying to explain why the union of SSN+birthday+usernum is the primary key DOES take documentation (which doesn't get read and leads to bugs).

Ok, so once you accept this, the next thing to wrap your head around is the distinction between object identity and object equality. In this case, the id represents identity and the business keys (read: member variables) are used for equality (and hashcode). So keep ID out of your equals() code.

I recommend reading some of the Hibernate docs (either the reference guide pdf or the Manning book) and a chapter or 2 on ActiveRecord. Much easier to understand why these things are important in the context of an application framework that in ER diagrams like on Sun's DAO site.

Oh, and speaking of generated IDs, remember that they may or may not be consecutive integers. When you have a cluster of databases, it is important that these keys don't collide. Usually this is solved by using a UUID, a central key generation service, or different offsets and increments between the keys (read docs on multimaster MySQL replication for more).

And on a final note, use an ORM library.

shelly:

you know how sometimes you run across postmodern poetry when you're browsing the racks at your local indie bookseller, and some "staff" person has recommended it, and it's obviously pretentious nonsense, but you read one poem that somehow is FANTASTICALLY BEAUTIFUL nonsense?

this post is a lot like that for me. i have no idea what the fuck you're talking about, but it's really pretty and rhythmic and somewhere i know it translates into something that happens in the sort-of physical world. and that elevates it into some kind of metawondermetaland. good work.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Join All Too Flat now!
Site Map [rss] Huge Huge! © 2005 Contact The Webmaster
Donate to help Alltooflat with the bandwidth bills