Wednesday, March 21, 2012

primary key error with INSERT INTO

Using the following t-sql statement on table with a primary key [DateTime], I get a primary key violation. How can I avoid adding duplicate records?

INSERT INTO [destSchema].[destTable]

SELECT t2.*

FROM [srcSchema].[srcTable] t2

LEFT JOIN [destSchema].[destTable] t1

ON t2.[DateTime] = t1.[DateTime]

WHERE (t1.[DateTime] IS NULL) AND (t1.[DateTime] <> t2.[DateTime])

ORDER BY t1.[DateTime];

Is [destTable].[DateTime] the primary key?

Code Snippet

INSERT INTO [destSchema].[destTable]

SELECT

t2.*

FROM

[srcSchema].[srcTable] t2

LEFT OUTER JOIN

[destSchema].[destTable] t1

ON

t2.[DateTime] = t1.[DateTime]

WHERE

t1.[DateTime] IS NULL

|||

Yes

|||

The dupe data can be coming from t2. So, you will have to decide what you want to insert into t1.

This query will give you a list of dupe dates.

Code Snippet

select t2.[DateTime]

from [srcSchema].[srcTable] t2

where not exists(select 1 from [destSchema].[destTable] t1 where t2.[DateTime] = t1.[DateTime])

group by t2.[DateTime]

having count(*)>1

|||

The code to list dupe dates works great. However, the other code generates the same primary key error that I′ve been getting all along:

Msg 2627, Level 14, State 1, Line 1

Violation of PRIMARY KEY constraint 'PK_destTable'. Cannot insert duplicate key in object 'destSchema.destTable'.

The statement has been terminated.

|||

If the code lists dupes, you will have to clean your data in table t2 before you insert it into t1. The bottom line, you have to guarantee the data from t2 is unique before you commit inserting into t1 - this involves either deleting the duped data or only selecting a row for each name. Else, you will get the primary constraint violation. This is by design.

Only you know your data, you will have to make the choice of what to insert into t1. If you post DDL + sample data + expected result here, we might be able to offer a solution.

|||

The following are sample fields in the source table, actual field names vary depending on the source but they all contain DateTime (Field0):

[DateTime] [datetime] NOT NULL,

[Field1] [decimal](10, 2) NULL,

[Field2] [decimal](10, 2) NULL,

[Field3] [decimal](10, 2) NULL,

[Field4] [decimal](10, 2) NULL

Here is some sample data from the source table that demonstrates the problem (non-black lines indicates duplicated rows). Note that the DateTime value is duplicated but the other fields contain different values:

2005-11-28 18:21:00,498.70,498.70,498.70,498.70
2005-11-28 18:22:00,498.50,498.50,498.50,498.50
2005-11-28 18:22:00,502.90,502.90,502.90,502.90
2005-11-28 18:23:00,498.40,498.40,498.40,498.40
2005-11-28 18:26:00,498.30,498.30,498.30,498.30
2005-11-28 18:26:00,502.70,502.70,502.70,502.70
2005-11-28 18:28:00,502.70,502.70,502.70,502.70
2005-11-28 18:28:00,498.40,498.40,498.40,498.40
2005-11-28 18:30:00,502.60,502.60,502.60,502.60
2005-11-28 18:31:00,498.30,498.30,498.30,498.30
2005-11-28 18:32:00,502.60,502.60,502.60,502.60
2005-11-28 18:33:00,502.60,502.60,502.60,502.60
2005-11-28 18:34:00,502.60,502.60,502.60,502.60
2005-11-28 18:36:00,502.50,502.50,502.50,502.50
2005-11-28 18:39:00,502.40,502.40,502.30,502.30
2005-11-28 18:39:00,498.10,498.10,498.00,498.00

Desired results:

2005-11-28 18:21:00,498.70,498.70,498.70,498.70
2005-11-28 18:22:00,498.50,498.50,498.50,498.50
2005-11-28 18:23:00,498.40,498.40,498.40,498.40
2005-11-28 18:26:00,498.30,498.30,498.30,498.30
2005-11-28 18:28:00,502.70,502.70,502.70,502.70
2005-11-28 18:30:00,502.60,502.60,502.60,502.60
2005-11-28 18:31:00,498.30,498.30,498.30,498.30
2005-11-28 18:32:00,502.60,502.60,502.60,502.60
2005-11-28 18:33:00,502.60,502.60,502.60,502.60
2005-11-28 18:34:00,502.60,502.60,502.60,502.60
2005-11-28 18:36:00,502.50,502.50,502.50,502.50
2005-11-28 18:39:00,502.40,502.40,502.30,502.30

Any and all help appreciated.

|||

Here you go.

Code Snippet

create table #tmp([DateTime] [datetime] NOT NULL,
[Field1] [decimal](10, 2) NULL,
[Field2] [decimal](10, 2) NULL,
[Field3] [decimal](10, 2) NULL,
[Field4] [decimal](10, 2) NULL)
go
create table #tmp2([DateTime] [datetime] NOT NULL,
[Field1] [decimal](10, 2) NULL,
[Field2] [decimal](10, 2) NULL,
[Field3] [decimal](10, 2) NULL,
[Field4] [decimal](10, 2) NULL)

go
insert #tmp
select '2005-11-28 18:21:00',498.70,498.70,498.70,498.70
union all select '2005-11-28 18:22:00',498.50,498.50,498.50,498.50
union all select '2005-11-28 18:22:00',502.90,502.90,502.90,502.90
union all select '2005-11-28 18:23:00',498.40,498.40,498.40,498.40
union all select '2005-11-28 18:26:00',498.30,498.30,498.30,498.30
union all select '2005-11-28 18:26:00',502.70,502.70,502.70,502.70
union all select '2005-11-28 18:28:00',502.70,502.70,502.70,502.70
union all select '2005-11-28 18:28:00',498.40,498.40,498.40,498.40
union all select '2005-11-28 18:30:00',502.60,502.60,502.60,502.60
union all select '2005-11-28 18:31:00',498.30,498.30,498.30,498.30
union all select '2005-11-28 18:32:00',502.60,502.60,502.60,502.60
union all select '2005-11-28 18:33:00',502.60,502.60,502.60,502.60
union all select '2005-11-28 18:34:00',502.60,502.60,502.60,502.60
union all select '2005-11-28 18:36:00',502.50,502.50,502.50,502.50
union all select '2005-11-28 18:39:00',502.40,502.40,502.30,502.30
union all select '2005-11-28 18:39:00',498.10,498.10,498.00,498.00
go
;with cte
as
(select *,
row_number() over(partition by [datetime] order by [datetime] ) r
from #tmp
)
insert #tmp2
select [Datetime],Field1,Field2,Field3,Field4
from cte
where r=1
and not exists(select 1 from #tmp2 t2 where t2.[Datetime]=cte.[Datetime])
go
select * from #tmp2
go
drop table #tmp, #tmp2

|||

With mycte

as

(SELECT myDatatime, f1, f2, f3, f4 FROM

(SELECT myDatatime, f1, f2, f3, f4, ROW_NUMBER() OVER(partition by myDatatime ORDER BY f1) as RowNum

FROM dupDateTimedata) t

WHERE RowNum=1)

SELECT * INTO dupDateTimedataRemoved

FROM mycte

|||

limno, "order by f1" will not give you the "top 1"...i.e. you will get this instead of the desired row.

2005-11-28 18:39:00.000 498.10 498.10 498.00 498.00

|||

Thanks oj for pointing this out. The problem is even with order by [datetime], we may still not get the right result.

We may need a little more clarification from rwbta to confirm your result.

My intention was by using Partion by datetime then I will keep the samllest number for f1 within the same datetime rows.

|||

Since we partition by datetime, order by datetime again will force the engine to generate the rownumber based on the logical order of the rows which we then select only the first row. Essentially, it is equivalent to "select top 1 * from tb" - this is what was asked by the OP as the desired result.

|||

This certainly turned out more complicated than I imagined. What additional information is needed?

Just as a summary, my original intention was to insert records into a new or existing table without including duplicate DateTime (primary key) values. If that's not possible, I would like to remove records in the source table which contain duplicate DateTime values.

Since the fields are not likely to contain exactly the same values in the duplicated records, DISTINCT won't work. Only the DateTime values are duplicated, inserting only the first occurrence of a duplicated DateTime would be acceptable. Or, alternatively, deleting subsequent duplications in the source table.

|||

Below is what I have done to resolve this problem. Add a primary key ID to the source table to aid in identification of duplicate DateTime's. Then delete duplicates. After that I can insert into a new or existing table.

Add PK ID:

ALTER TABLE srcSchema.srcTable

ADD

DataID int NOT NULL IDENTITY(1, 1),

CONSTRAINT PK_srcTable PRIMARY KEY(DataID)

Delete Duplicates:

DELETE FROM

t1

FROM

srcSchema.srcTable t1

INNER JOIN

(

SELECT

MIN(DataID) AS DataID,

[DateTime]

FROM

srcSchema.srcTable

GROUP BY

[DateTime]

HAVING

COUNT(*) > 1

) t2

ON(

t1.[DateTime] = t2.[DateTime]

AND

t1.DataID <> t2.DataID

)

No comments:

Post a Comment