I have issue where based up a value in a column i need to do some processing of the previous and current row. The dataflow is also already sorted. I tried creating a Script Data Flow Transformation to do this but it isn't working right and the debugging of it sucks. Would anyone know of the best way to do this? or some helpful pointers? I tried "firing" information to help debug but doesn't help when the error message i get back is a stack overflow message.
An example of what I'm trying to do is process the sorted incoming rows for each person. Each person can have multiple rows. Based upon a "status" column in each row do some different processing on the previous or current row. Some Psuedo code:
- if prev.PersonID = current.PersonID if status = 1 change prev.PersonDate to today + 60 days if status = 2 change current.PersonDate to prev.PersonDate change prev.PersonDate to today + 1 day else
- send rows to output
Though, if you have a variable number of rows per PersonID, I'm not sure how well you can implement this in SSIS. Jay or some of the other script guys around here might have to chime in.|||
Yeah each person can have 1 or more rows. There isn't a defined set of rows per person. Yeah I looked at the link and that wouldn't work in my situation. Not sure if there would be a good way to do it in straight SQL either. Does anyone have suggestions on how to do it in SQL? If possible? Without cursors?
I think i'm going to play around with SQL for little bit and see if i can't come up with something, however executing a OLE DB Command on each record on 1+ million records is going to slow down the process significantly i suspect. I wonder if there is a batch update way of doing this?
|||I've done this many times. Usually I need to do some type of complex aggregation of multiple rows and only output a single row for a distinct entity. Its a great benefit that your data is already sorted.I wrote the code below mostly from memory, so there may be some syntax problems, but hopefully you get the idea. You cache each row until you've seen the next one. So you're always writing one row behind the current one. You have to override FinishOutputs so you can write out your last row.
Code Snippet
Public Class ScriptMain
Inherits UserComponent
Private Class BufferClass
'define class members for columns
Public PersonID As Integer
Public Status As Integer
Public PersonDate As DateTime
End Class
Dim PreviousPersonID As Integer = -1
Dim Buffer As BufferClass = Nothing
Public Sub WriteBuffer()
If Not Buffer Is Nothing Then
With NewRecordsBuffer
.AddRow()
' add the persisted values from the class to the output buffer
.PersonID = Buffer.PersonID
.Status = Buffer.Status
.PersonDate = Buffer.PersonDate
End With
Buffer = Nothing
End If
End Sub
Public Overrides Sub FinishOutputs()
'write the previous row
WriteBuffer()
MyBase.FinishOutputs()
End Sub
Public Overrides Sub Input_ProcessInputRow(ByVal Row As InputBuffer)
Dim PreviousPersonDate As DateTime
If Row.PersonID = PreviousPersonID Then
'modify previous row
If Row.Status = 1 Then
Buffer.PersonDate = DateAdd("d", DateTime.Today, 60)
ElseIf Row.Status = 2 Then
PreviousPersonDate = Buffer.PersonDate
Buffer.PersonDate = DateAdd("d", DateTime.Today, 1)
End If
End If
'write previous row
WriteBuffer()
'buffer the current row
Buffer = New BufferClass
With Buffer
' fill the class with columns that need to be persisted
.PersonID = Row.PersonID
.Status = Row.Status
If Row.Status = 2 Then
.PersonDate = PreviousPersonDate
Else
.PersonDate = Row.PersonDate
End If
End With
PreviousPersonID = Row.PersonID
End Sub
End Class
|||
Thanks! I'll give this a shot. This is similar to what I was doing except I was getting a stack overflow problem and not sure why. The only difference was that I was buffering all of each persons rows (each person had 1 or more rows with the most being 14), but only changing data in the current and previous rows. When current row was a different person then it would output the all the buffered rows.
Anyways I'll give this a shot and see.
|||JayH- Does this process normally take a lot of time? It is extremely slow?|||Hmmm I took out the FireInformation method I was using to debug (just one line) and now it is 1000 times faster.
JayH - I was wondering if it would be faster to output each row or to output a the person batch of rows?
|||It works! Thanks JayH! I implemented mine so that it handles batches than just the previous row. As far as performance between each row or a batch I'm not sure, but implementing as a batch is faster than a Merge Join transformation.|||
thames wrote:
It works! Thanks JayH! I implemented mine so that it handles batches than just the previous row. As far as performance between each row or a batch I'm not sure, but implementing as a batch is faster than a Merge Join transformation.
I'm glad you got it going. I doubt you'll find any performance difference between lagging only one row and all the rows for the PersonID, especially since you're only expecting a max of four rows per PersonID. I think the code is probably simpler to only do it for one row, but its just a matter of preference.
No comments:
Post a Comment