IxiRancid Posted September 27, 2005 Posted September 27, 2005 Is there a way to find/match non-primarykey records in two different tables? problem: TableA - contains records ('John', 'Dhough', ' ') TableB - contains same records as TableA ('John', 'Dhough', 'D') both tables do not have primary keys when a transaction is made a record from TableA is replicated into TableB, however it has a filed marked 'D', matching record from TableB must be then deleted. Major problem: as the rows are identical (except 'D'), no primary key, but there could be up to 25 SAME, IDENTIC records in TableB: 1. 'John', 'Dhough', ' ' 2. 'John', 'Dhough', ' ' 3. 'John', 'Dhough', ' ' So, which one to delete? I've been looking at some RID (record identifier), but can't find anywhere if these RID's are perhaps equal to some byte? Quote
kejpa Posted September 27, 2005 Posted September 27, 2005 The easiest way is to use a primary key. ALL tables should have a primary key, or at least a unique. How else can you distinguish between two records? Make sure you have primary keys for your tables and your problems are gone. Along with a lot of other hard-to-trace anomalities ;) HTH /Kejpa Quote
IxiRancid Posted September 27, 2005 Author Posted September 27, 2005 Yes I know, however this is a 15 year old DB2 system built by outsourcing company and this wasn't meant to be a big transactional base dynamicaly used, more as a storage. So now we got this problem... I though that something could be done using RID's or any other suggestions. This is an issue however: records in TableB ARE ALL EQUAL, so why even bother which one to delete? Just delete one and that doesn't have any effect on others ;) Quote
kejpa Posted September 27, 2005 Posted September 27, 2005 Don't know much about DB2 and RID's. By using simple SQL you can't distinguish between identical records. If you try to delete one you delete'em all. Some database engines support "top"/"limit" or other keyword for restricting the number of affected records, the DB2 syntax and options am I unfortunately unaware of. /Kejpa Quote
IxiRancid Posted September 27, 2005 Author Posted September 27, 2005 Hmm... I wonder how DB2 or SQL actualy inserts records? With no primary key or grouping specified it probably just adds row one after another? However in what order does a normal SELECT * FROM TABLE return these rows? I'll try tomorrow. Thanks for suggestions! Quote
kejpa Posted September 28, 2005 Posted September 28, 2005 As I said, I've never used DB2 my experience is from Access, SQL and mySQL. If you insert three records they appear as 1, 2, 3. If you then delete #2 and insert a 4th and a 5th the order usually will be 1, 4, 3, 5. Records are inserted where there is "empty" space, if there is no primary key defined. So there's no telling which row was inserted last :( HTH /Kejpa Quote
*Experts* Nerseus Posted September 28, 2005 *Experts* Posted September 28, 2005 If this were me, I'd add an "identity" column to your table first. You can make this field up yourself if you want - just make sure it's a unique value for every row. Then you're left with deleting the duplicates. That shouldn't be that hard. What I generally do is something like the following. This assumes you have two fields you want to match on, FirstName and LastName: SELECT MIN(MyNewID) AS MyNewID, FirstName, LastName INTO #temp FROM Table1 GROUP BY FirstName, LastName DELETE FROM Table1 WHERE MyNewID NOT IN (SELECT MyNewID FROM #temp) This puts all matching records in a temp table (works in SQL Server, not sure about DB2). The GROUP BY will get you a single row - you said you don't care which one. The MIN(MyNewID) is to, again, get a single ID per matching set of values. I chose MIN but you could use any aggregate function. You said it didn't matter, so I prefer MIN - it may find it faster if the table has an index. If you CAN'T add a column to the existing table, I'd suggest first copying the table to a new table where you CAN add the unique column. Do your cleanup there then TRUNCATE the original table and copy everything back. -ner Quote "I want to stand as close to the edge as I can without going over. Out on the edge you see all the kinds of things you can't see from the center." - Kurt Vonnegut
IxiRancid Posted September 28, 2005 Author Posted September 28, 2005 Nice and constructive suggestion, however there are numerous applications that use this table and the other replicated on SQL. I don't think this would work in a short term... @Kejpa: that's interesting, didn't know this is the way it works. Does this mean that SQL actualy has some reference where a record was ("empty space")? --- If you insert three records they appear as 1, 2, 3. If you then delete #2 and insert a 4th and a 5th the order usually will be 1, 4, 3, 5 --- Quote
kejpa Posted September 29, 2005 Posted September 29, 2005 In some sense yes. Access use diskspace for this, a small database where you have made a number of deletes, inserts and updates can easily be over 10MB then when you compact it it shrinks back to fit a diskette. SQL server (SQL is a language you talk to databases with) uses some other means for detecting "freed space" Haven't studied how mySQL does it, I nowadays always use primaries (<LOL> yeah, right!) /Kejpa Quote
*Experts* Nerseus Posted September 29, 2005 *Experts* Posted September 29, 2005 In SQL Server with a clustered index, the engine stores rows on the disk in the order defined by the clustered index. The clustered index does not have to match the primary key, but often does. If you had a clustered index then inserting "row 2" where 4 million rows currently exist (1, followed by 3 through 4 million) would take a LONG time as the engine would have to move rows 3 through 4 million "up" on the disk to make room for row 2. To go back to your original question - you need to identify and delete duplicate rows but you can't modify the table? The only solution that comes to mind is to use a cursor to loop over the rows and delete inside the loop. That seems really nasty compared to adding a column for a one time cleanup task - you could always remove the column if you're worried it will cause issues. -ner Quote "I want to stand as close to the edge as I can without going over. Out on the edge you see all the kinds of things you can't see from the center." - Kurt Vonnegut
IxiRancid Posted September 29, 2005 Author Posted September 29, 2005 Heh :) it is nasty, specialy with the deadline we have. But we'll probably use outsourcing for that. And yes, we cannot change the table structure. This was another issue: as I said it really doesn't matter which (identical) record we delete, but we would need to make an logging table which records and when were deleted. That wouls solve this issue, but this is a large operation and we have limited resources in networking and daily deadlines itself. I'll let you know what the solution was/will be :) Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.