Tuesday, March 20, 2012

Primary Key and Table Design Question

What would be a reasonable primary key to use in the following scenario?
I need to provide new functionality to an existing Web site. This new
functionality will be questionnaires (containing from 3 to 35 questions
each). Visitors to the site will open a questionnaire, answer the
questions, then click a "Submit" button. When the page is submitted to the
Web server, responses will need to be saved to the database (SQL 2K). Users
will not be logged in. For the sake of this question, please assume we have
dealt elsewhere with the issue of individual users submitting the same
survey multiple times (and other such issues not directly related to the
table design required to support this new functionality). Administrative
pages will enable the site's administrators to (1) define new Surveys, (2)
create new questions for each survey, and (3) retrieve and review responses
to existing surveys.
Three obvious entities are apparent to me: "Surveys", "Survey Questions" and
"Survey Response Sets"
"Surveys" would have a corresponding table that describes each survey
(title, subject, start_date, end_date, etc).
"Survey Questions" would have a corresponding table that holds things like
question_text, presentation_sequence, etc.
"Survey Response Sets" would have a corresponding table that holds responses
to each question.
I see one-to-many relationship from Surveys to SurveyQuestions, and from
SurveyQuestions to SurveyRespons Sets.
Given this scenario, what would you use as the primary key for each of these
tables? In a former life I would have used an IDENTITY property for each
table - but I've painfuly realized the downsides of going that route. So,
now that I'm trying to get away from IDENTITY, I'm wondering what would make
sense for my scenario. There isn't any standardized or well-known/industry
standard for Survey IDs, nor Question IDs, nor Survey Response Set IDs. Nor
is there any legacy system I'm converting from that already has the PK for
me to use.
Thanks!What form do the responses take? Multiple choice? Free form text? Or
something else? As you aren't recording names it would seem a bit strange to
allow entirely free format responses (there's probably little you can do to
analyze such data in the database anyway) but you haven't mentioned any
other scheme. As you aren't identifying the individual users I assume you
are only interested in the total number of times each reply is given, hence
the "response_tally" column in the following first-guess at a logical
design:
CREATE TABLE surveys (survey_no INTEGER PRIMARY KEY, survey_title
VARCHAR(50) NOT NULL UNIQUE, survey_subject VARCHAR(50) NOT NULL, start_date
DATETIME NOT NULL, end_date DATETIME NOT NULL, CHECK (start_date<=end_date))
CREATE TABLE survey_questions (survey_no INTEGER NOT NULL REFERENCES surveys
(survey_no), sequence INTEGER NOT NULL CHECK (sequence>0), question_text
VARCHAR(255) NOT NULL, PRIMARY KEY (survey_no, sequence))
CREATE TABLE survey_responses (survey_no INTEGER NOT NULL, sequence INTEGER
NOT NULL, FOREIGN KEY (survey_no, sequence) REFERENCES survey_questions
(survey_no, sequence), response_text VARCHAR(50) NOT NULL /* constraints ?
*/, response_tally INTEGER NOT NULL /* number of times this answer was given
*/, PRIMARY KEY (survey_no, sequence, response_text))
David Portas
SQL Server MVP
--|||Thank you so much David for your response. I understand that I didn't give a
whole lot about the project's overall objectives... That's a judgement call
I made based on my wanting, most particularly, to learn alternative ways to
implement a primary key that is as something other than an IDENTITY property
(in cases where I don't want to use a natural key). I didn't want a natural
key here because the question_text column, which would be a candidate, not
only will be a varchar, but it may be quite long in some cases.
So, your response shows me something I'd be comfortable using - as integers
are used in the primary key. Now, continuing with your DDL, from where would
I get the actual integer values to use for [survey_no]? I have seen some of
you experts recommend a "numbers table" Would that be appropriate in this
scenario?
FWIW, these "surveys" are really not very static. It's not like we can say
that they all will take a specific format, have a pre-determined number of
questions, each of which is of any certain data type. For a sample of the
sort of thing we're implementing, you can look at this one:
http://www.jaguarwoman.com/order.html What we want to do is present a form
to the site's visitor, control to the best extent we can the number of times
a given user/visitor can submit the form, and then store the results for
later reporting. Some such forms will be simple info request forms like at
the above URL, others will be actual surveys or questionnaires with Likert
scale-type responses, upon which we'll be performing statistical analyses.
And yes - we are most certainly NOT treating these as anything near
scientific (unless that particular Web site does force login with a valid
ID/password...). Given that these surveys/forms/questionnaires are
potentially so different per customer Web site, I didn't think it would be
useful to post DDL for each possible one - HOWEVER each implementation would
likely involve some variation of the three tables described in the OP, and
for which you provided a "best guess" DDL given that you can't read my mind
: )
Thanks!
"David Portas" <REMOVE_BEFORE_REPLYING_dportas@.acm.org> wrote in message
news:8NSdnSiIaJDU9QPfRVn-pA@.giganews.com...
> What form do the responses take? Multiple choice? Free form text? Or
> something else? As you aren't recording names it would seem a bit strange
> to allow entirely free format responses (there's probably little you can
> do to analyze such data in the database anyway) but you haven't mentioned
> any other scheme. As you aren't identifying the individual users I assume
> you are only interested in the total number of times each reply is given,
> hence the "response_tally" column in the following first-guess at a
> logical design:
> CREATE TABLE surveys (survey_no INTEGER PRIMARY KEY, survey_title
> VARCHAR(50) NOT NULL UNIQUE, survey_subject VARCHAR(50) NOT NULL,
> start_date DATETIME NOT NULL, end_date DATETIME NOT NULL, CHECK
> (start_date<=end_date))
> CREATE TABLE survey_questions (survey_no INTEGER NOT NULL REFERENCES
> surveys (survey_no), sequence INTEGER NOT NULL CHECK (sequence>0),
> question_text VARCHAR(255) NOT NULL, PRIMARY KEY (survey_no, sequence))
> CREATE TABLE survey_responses (survey_no INTEGER NOT NULL, sequence
> INTEGER NOT NULL, FOREIGN KEY (survey_no, sequence) REFERENCES
> survey_questions (survey_no, sequence), response_text VARCHAR(50) NOT NULL
> /* constraints ? */, response_tally INTEGER NOT NULL /* number of times
> this answer was given */, PRIMARY KEY (survey_no, sequence,
> response_text))
> --
> David Portas
> SQL Server MVP
> --
>|||I've worked with a design that uses two procedures: GetNextSurrogateKey and
GetNextBlockSurrogateKey. The first reserves the next key value and returns
it in an output parameter. The second reserves a specified number of key
values and returns the first key value in an output parameter. The next key
value for each table is stored in a table with one record per Surrogate Key
table. GetNextSurrogateKey increments the next key value field. The proble
m
with this approach is two-fold: (1) it increases the probability of deadlock
s
and (2) it reduces concurrency. Calls to GetNextSurrogateKey must occur in
the same order in every transaction--in other words, you have to get the nex
t
key for TableA before getting the next key for TableB in each transaction
that occurs against the database. Even if you use the WITH ROWLOCK hint, th
e
optimizer may escalate to a PAGE LOCK, which effectively blocks inserts into
tables whose SurrogateKey record resides on that page. This can lead to
deadlocks which are really hard to debug, or at a minimum waiting for record
s
locked by another process. The implementation I saw padded the records so
that they were stored one per page in a logically flawed attempt to get
around this.
I prefer to use IDENTITY columns to avoid the above pitfalls. It requires
extra code on the client to obtain the identity value(s), and it's painful
when you're inserting records into related tables en mass, but in my opinion
the benefits outweigh the subsequent maintenance and debugging nightmares
that are sure to ensue.
If you're dead set against using IDENTITY columns, you could write an
extended stored procedure or COM object to implement the above
GetNextSurrogateKey pattern. The xp would execute outside the current
connection, which would prevent the concurrency and deadlock issues describe
d
above. A COM object would scale better, because it would minimize the
overhead associated with initiating a new connection for each call, because
it could maintain a pool of open connections..
"Jeffrey Todd" wrote:

> Thank you so much David for your response. I understand that I didn't give
a
> whole lot about the project's overall objectives... That's a judgement cal
l
> I made based on my wanting, most particularly, to learn alternative ways t
o
> implement a primary key that is as something other than an IDENTITY proper
ty
> (in cases where I don't want to use a natural key). I didn't want a natura
l
> key here because the question_text column, which would be a candidate, not
> only will be a varchar, but it may be quite long in some cases.
> So, your response shows me something I'd be comfortable using - as integer
s
> are used in the primary key. Now, continuing with your DDL, from where wou
ld
> I get the actual integer values to use for [survey_no]? I have seen some o
f
> you experts recommend a "numbers table" Would that be appropriate in this
> scenario?
>
> FWIW, these "surveys" are really not very static. It's not like we can say
> that they all will take a specific format, have a pre-determined number of
> questions, each of which is of any certain data type. For a sample of the
> sort of thing we're implementing, you can look at this one:
> http://www.jaguarwoman.com/order.html What we want to do is present a form
> to the site's visitor, control to the best extent we can the number of tim
es
> a given user/visitor can submit the form, and then store the results for
> later reporting. Some such forms will be simple info request forms like at
> the above URL, others will be actual surveys or questionnaires with Likert
> scale-type responses, upon which we'll be performing statistical analyses.
> And yes - we are most certainly NOT treating these as anything near
> scientific (unless that particular Web site does force login with a valid
> ID/password...). Given that these surveys/forms/questionnaires are
> potentially so different per customer Web site, I didn't think it would be
> useful to post DDL for each possible one - HOWEVER each implementation wou
ld
> likely involve some variation of the three tables described in the OP, and
> for which you provided a "best guess" DDL given that you can't read my min
d
> : )
> Thanks!
>
>
> "David Portas" <REMOVE_BEFORE_REPLYING_dportas@.acm.org> wrote in message
> news:8NSdnSiIaJDU9QPfRVn-pA@.giganews.com...
>
>|||Thank you Brian for your thoughtful response. It is very helpful to have
that insight before I go off and paint myself into another corner that has
bad problems like the IDENTITY implementation would have.
<<but in my opinion the benefits outweigh the subsequent maintenance and
debugging nightmares that are sure to ensue>>
From the research I have done in my efforts to get away from IDENTITY and do
things "properly", I've discovered that a lot of work will have to be done
regardless of which approach is chosen (IDENTITY vs anything else). There
are substantial problems to be mitigated with intelligent decision-making
with every approach I've seen - even with the highly touted "natural keys"
(cascading updates notwithstanding).
So, I guess for my scenario, if you were the one having to implement it,
you'd go with IDENTITY. Maybe I should reconsider, and go with IDENTITY too.
The thing I hate most about IDENTITY is that everything falls apart when
migrating data from one DB to another, something I want to be able to do
without having to think about changing primary key values.
Anyone else? Thoughts, rants, opinions, and perspective or suggestions on my
particular scenario are greatly appreciated.
-JT
"Brian Selzer" <BrianSelzer@.discussions.microsoft.com> wrote in message
news:48354E54-A332-40B1-A9C8-B5AA78A3843F@.microsoft.com...
> I've worked with a design that uses two procedures: GetNextSurrogateKey
> and
> GetNextBlockSurrogateKey. The first reserves the next key value and
> returns
> it in an output parameter. The second reserves a specified number of key
> values and returns the first key value in an output parameter. The next
> key
> value for each table is stored in a table with one record per Surrogate
> Key
> table. GetNextSurrogateKey increments the next key value field. The
> problem
> with this approach is two-fold: (1) it increases the probability of
> deadlocks
> and (2) it reduces concurrency. Calls to GetNextSurrogateKey must occur
> in
> the same order in every transaction--in other words, you have to get the
> next
> key for TableA before getting the next key for TableB in each transaction
> that occurs against the database. Even if you use the WITH ROWLOCK hint,
> the
> optimizer may escalate to a PAGE LOCK, which effectively blocks inserts
> into
> tables whose SurrogateKey record resides on that page. This can lead to
> deadlocks which are really hard to debug, or at a minimum waiting for
> records
> locked by another process. The implementation I saw padded the records so
> that they were stored one per page in a logically flawed attempt to get
> around this.
> I prefer to use IDENTITY columns to avoid the above pitfalls. It requires
> extra code on the client to obtain the identity value(s), and it's painful
> when you're inserting records into related tables en mass, but in my
> opinion
> the benefits outweigh the subsequent maintenance and debugging nightmares
> that are sure to ensue.
> If you're dead set against using IDENTITY columns, you could write an
> extended stored procedure or COM object to implement the above
> GetNextSurrogateKey pattern. The xp would execute outside the current
> connection, which would prevent the concurrency and deadlock issues
> described
> above. A COM object would scale better, because it would minimize the
> overhead associated with initiating a new connection for each call,
> because
> it could maintain a pool of open connections..
> "Jeffrey Todd" wrote:
>

No comments:

Post a Comment