Monday, May 19, 2014

logic delete and data integrity in RDBMS


Something came to my mind on this topic and gave it some thought and I would like to share with you.. Please jump in with your take on it, I am do not think I have the golden solution for it.  Okay let me get it started.

 In the database world, data integrity is second to none in term of importance. More specifically there are Entity integrity and Referential integrity are the two most vital integrities of any database.

Entity integrity ensure there is no duplicated rows in the table, in a relational database this is enforced by defining primary key and unique constraint.

Referential integrity ensure that the reference value are valid value in the master table.  Take an example, you want to ensure that the state code in the address table is a value state code in the State table.  In a relational database this is enforced by defining foreign key.

Those are working well.  Wait, they working well before logic delete is implemented. They stop working as soon as logic delete gets into the picture.

Consider, this case, in our database we have a table call Course which lists all courses offered by the college.  From business point of view, the course code is the unique of the table. There is a table call Student, which stores all students if the college. There is a table call [Course Election] which stored all courses elected by all students.

Before we logic delete is implemented, the primary key for Course table is [Course Code], the primary key for Student is [Student Id], the primary key for [Course Election] is [Student Id] and [Course Code] combine .   Then we define the foreign keys in [Couse Election] for [Course Id] in reference [Course] table and define foreign key in [Course Election] for [Student Id] in reference [Student] table.

With this design, no row could be added to [Course Election] without valid [Student Id] or [Course Code], for each [Student Id] and [Course Code] combine, there is only one row is allowed in [Course Election] Table. These are all good. But let’s add logic delete flag in Student table and Course table.

This is how we design it to work, when user update the information for a student, the original row will be marked as Deleted and a new row is to be inserted. The similar logic is implemented for Course Table.  From the surface, it looks neat. I have one table storing the current data as well as history data.  But if you take a closer look you would find you created more problems than you plan to solve.  First of all, what would be the primary key for course and Student tables? Just   [Student Id] and [Course Code] would not work, because the same student may have many rows, with many deleted rows and one active row.  How about added the active flag as part of primary key? That wouldn’t work either. Because there could be many deleted row for the same student.  How about adding a creation TimeStamp column into the table and make it part of primary key? Well the table now accept many deleted row for the same person as long the rows are not created at the same time, but it also accept multiple active row for the same student as long as there are not created at the same time. This is not what you want to see.

Enough about the problem on primary key. The situation on foreign key is not in any way better. Let’s say we have student got deleted because he or she has transferred to another college, the row is still in the table for the student, it is just that the Deleted Flag is true.  From foreign key point of view, there is no problem we added an election for a specific course for the transferred student.  But from business point of view, that is against the business rule.

The conclusion is that as soon as logic delete is implemented, our hope of using primary key, and foreign key to enforce Entity integrity and Referential integrity is gone for good.

The question still, the business requirement needs to store the history information on these important tables, how do we do it?  It is a valid question for any data model designer.  Well, this is my solution to the business problem.

I will define the [Course] Table and [Student] Table as they are, no logic delete flag, and I will design the [Course Election] Table as pre normal. This is so that the primary key and foreign keys are working as expected. Then I will design [Couse Audit]  and [Student Audit] Table. These audit tables have 2 set of columns, one for current values and one for previous values and with a set of columns recording, who, when and what kind of information. In these tables, the primary key will be a guid or an identity column. No foreign keys no unique constraint.

Then I create a trigger for the table to handle AFTER DELETE, INSERT, UPDATE to insert into the Audit table with the following query


Select *
From inserted I
FULL OUTER JOIN Deleted on I.Id = D.Id

With this design, the data history is preserved, the data integrity is enforced by the database schema, the coding (whether it is  .Net coding or store procedure coding ) are simplified

Historically speaking, logic delete was one of the legacy carried form Mainframe system in the time there was no relational database yet. If we carry this legacy item into our modern database design without fully understanding of the impact of such legacy, we might got ourselves in trouble… be careful. You want your modern database designed as modern database which data integrity is enforced by the database engine.      

No comments:

Post a Comment