Results 1 to 15 of 15
  1. #1
    Join Date
    Aug 2006
    Posts
    4

    Lightbulb Dimensional Design of a Factless Fact with too many attributes

    I am exploring the design options for the dimensional model of Health Care System. In this design we don't have any financial data or any sort of concrete measurements. The business process works as follows:

    The patient enters into the Hospital or Health Services Center and answers some questions on an assessment form. And by the answered questions, it is determined what kind of services are needed by the patient. This entire process is termed as "in-take assessment" and focuses solely on assessment.

    Mostly the reports will be measuring the count of assessments based on various dimensions. In this model we have Patient, Assessing Organization, Diagnosis (multivalued), Time, User (one who assesses), Assessment Type, Care Product, etc. as Dimensions and the Actual Assessments as the Factless Fact. The assessment process form consists of objective questions(defined values/radio button choices - mostly int or string) and open-ended questions (free text). There are approximately 100 questions in the form. The reporting criteria(where clause) can be one of the above mentioned dimensions or may be one of the answers given by the patient in the form. If I create one dimension per Q/A it will be a centipede design with too many dimensions.

    Now my question is

    *whether the assessment Fact should contain all answers in one row in different question columns (in which case, it will be a very big table - approximately 120 columns in one row with 30,000 Assessments each month) or

    *should the questions be in a dimension, in which case the dimension will be as large as the fact, in terms of rows. It will act more like a fact table. It is like having a Primary Key in the Fact table Assessment and Foreign Key in the Assessment Q/A table or

    *should I try to come up with various combinations of possible answers for these questions and then put them in a dimension, in which case the dimension again will be very large or

    *should I distribute the questions in combinations of 20 each and then apply possible answer combinations in 5 different dimensions or

    *should I create 1 dimension per questions/answer (centipede design)?

    Any thoughts/suggestions/guidance will be really appreciated.

  2. #2
    Join Date
    Dec 2003
    Location
    Canada
    Posts
    710
    I suggest going for the single large dimension or small set of dimensions grouped by question categories. It may seem cumbersome at first but in the long run it will pay off as the warehouse facts accumulate. Forget the centipede as the number of joins will defeat any benefit the dimensional model offers. Definitely do not put the answers in the fact table.

  3. #3
    Join Date
    Aug 2006
    Posts
    4

    Factless Fact

    Hi certus

    Thanks a lot. I appreciate your time and effort.

    What are your thoughts on the following approach:

    Fact table with more atomic granularity. 1 row per assessment question with 2 joins to Question table and an answer table.
    In this case the Fact will end up with 100 rows per assessment. And each patient is assessed multiple times, lets assume avg 5. So I will end up with 500 rows per patient. And they normally assess 30 thousand people per month. So this means 500*30,000 = 15000,000 rows per month and 15000,000 * 12 = 180000000 per year. This design is good in terms of only two joins. But in terms of no. of rows performance might still be a problem.

    I guess testing might be the answer. What r your thoughts on this?

    Gary
    Last edited by gary8877; 08-16-06 at 15:20.

  4. #4
    Join Date
    Dec 2003
    Location
    Canada
    Posts
    710
    Yes, I think that is the right approach.

  5. #5
    Join Date
    Sep 2002
    Location
    Sydney, Australia
    Posts
    255

    Re Post 1

    Gary

    Evidently, you have done some actual work, and achieved some clarity before posting, thankyou. I will try to give you a worthy response. The reporting requirement should be considered/designed independent to the db design (it may well be served by one report tool, then another, etc). My responses consider both requirements, independently. I note your thoughtful consideration of the where clause.

    1 If you're designing a relational db, use relational design
    2 If you're evaluating a relational report tool (dimensions) keep dimensional in mind (but do not design the db dimensionally). If the db design is pure relational with dimensional requirements in mind, you do not have to have grunt (price), or interpretation (restructuring dimensions & facts, parsing, coding) in the reporting tool.
    3 If your system is the slightest bit successful, it will grow (new dimensions, new possible facts; requiring changes to code segments, so avoid code at all costs)

    whether the assessment Fact should contain all answers in one row in different question columns (in which case, it will be a very big table - approximately 120 columns in one row with 30,000 Assessments each month)
    Absolutely not! That is an Unnormalised (non-relational) flat file and you should consider an alternate career path.

    should the questions be in a dimension, in which case the dimension will be as large as the fact, in terms of rows. It will act more like a fact table. It is like having a Primary Key in the Fact table Assessment and Foreign Key in the Assessment Q/A table
    Absolutely! That is what you will get if you perform a normalisation exercise (otherwise it is neither "relational" nor a "database").

    Normalised Database
    Consider this ER Diagram Dimension Db.pdf, which is the result of my normalisation of the requirement in your post. I am using pure IDEF1X plus a couple of my extensions (in this case for the dimensional requirement). It actually has the smallest and fastest data storage requirement. Now you have pure and and simple facts; dimensions that are uncomplicated and do not need parsing (no code segments). "One fact in one place."

    Report Column (or Row) Headings
    Additionally, if you put the Answer name-values (eg. Male, Female, Undetermined) in a name column in AnswerReference (as opposed to defining constraints in the AnswerValid or elsewhere), and the question text (short name, whatever) in QuestionReference:
    • your report tool will not have any code in it (which will require changing every time you change the questions/answers),
    • plus this will provide column or aggregate row (dimension) names in the reports without having to (additionally) place them in the report definition or the report tools "dictionary".
    A decent report tool will handle this (produce the required SQL for each report) without needing any additional (parsing) code.

    Evaluate Growth
    In each of the tables, evaluate the effect of growth/changes to forms/questions/possible-answers/actual-answers, over time - You will just have to add rows without changing the db definition.

    Consider, in contrast, the nightmare that growth will cause in the unnormalised creature, every time you add something (question, series of answers) you will have to change ddl. No wonder some nutcases come up with mutliple databases!

    Constraining Tables
    The middle layer (green) is required, as they define/constrain facts, but they are still true reference tables (not facts themselves) in the context of their children. This ensures that a fact (Answer) is actually valid (exists in the form) for the particular Assessment and for the particular Question, otherwise you will lose Referential Integrity. If you do not believe in storing normalised question components and answer components then it is the top level (blue) that you can remove, not the intermediate level (green). But this will result in a messy set of reference tables.

    Consider: a single question (eg. Sex; How often do you have sex; How often do you drink alcohol) may appear in many assessment forms; an answer may appear in many questions (eg. Male/Female/Undetermined; Once a week/twice a week///seven days a week)). Therefore I have normalised the question content and the answer content, therefore the middle level exists: AssessmentQuestion is the actual set of Questions for any one given form; AnswerValid is the actual set of Answers for any one given Question.

    BTW, this is a first cut, it is not perfect, the table names can probably be improved (naming is very important to understanding).

    Keys
    If you start with IDEF1X as a standard, then your handling (migration, propagation) of keys from the parent table to the child table is governed. This is a Good Thing. With a tiny bit of cleverness, if you choose the keys carefully, you can ensure dimensional requirements are met, without leaving the relational paradigm or (more strictly) the IDEF1X standard. Have a look here Dimension Key.pdf. With IDEF1X, the FK in the child is the PK of the Parent, and called the same name. In this model, the FK in the child also form its PK beautifully, ie. without columns added for uniqueness. It also allows for minimal indexing in Answer.

    Of course, you do not need two columns (roles) for QuestionId in Answer as it will be the same (parent) value: that column is part of the FK to both AssessmentQuestion and AnswerValid (and thus to QuestionRference).

    should I try to come up with various combinations of possible answers for these questions and then put them in a dimension, in which case the dimension again will be very large or
    should I distribute the questions in combinations of 20 each and then apply possible answer combinations in 5 different dimensions
    Only if you want the job of maintaining the database and coding reports for the rest of your natural life. Denormalisation means code at every point (database maint program; every code segment in the report tool) that has to handle the ugly beast and parse every fact out of the fact+fact+fact... where it is stored. Reports will take hours and days, and you will need very fat matchboxes (Client machines).

    Denormalisation
    Note, I make the distinction between Unnormalised (you have not done the task of Normalisation) and Denormalised (you have done the task but then gone backwards or messed it up). I have never seen a "Denormalised" design that did not have a massive maintenance burden. I have worked on several that were "denormalised for speed" which were a joke, as the truth was denormalisation caused additional overhead in every retrieval. Denormalisation is always due to the limits of the designer's brain, and nothing else. Remember, joining multiple tables is ordinary in a relational db. You can use Views to simplify (flatten) things for users or reports, that's what they are for.

    should I create 1 dimension per questions/answer (centipede design)
    [That's actually what I have in Normalised Database.] But the problem with your question relates to semantics. The term Dimension relates to reports, not databases; we are designing a relational db to serve the reports, so we need to store the data relationally (which means normalisation and "more tables" than the uninitiated would expect). On the relational or storage side, do not mention or try to apply "dimensions". But the reports only require dimensions, and they are dimensions to the user (who is really only interested in the statistical ability of the db). On the report/dimension side, do not mention normalisation or intermediate tables.

    The middle layer of tables (green) can be invisible to the report tool/user, it can do straight joins direct from Answer to AnswerReference, QuestionReference and AssessmentReference (assuming you use the IDEF1X standard and do not mutilate the Keys). Have a look at this diagram Dimension Rpt.pdf. Note Dimension Key diagram: they are pure joins and they "work"; you do not have to go through the intermediate layer if you know the PK of the (uppermost) parent table. This is where the value of IDEF1X can be seriously appreciated (and the insanity of IDENTITY columns as well).

    Aggregation
    Of course, the whole idea of dimensional reporting is in the ability to aggregate the (lower level) dimensions into higher levels. If you store only the lowest or atomic level, which I have done, then there is no limitation: you do the aggregation on the server and ship the smallest result set to the client. Or drag the smallest amount of the Answer table over to the client, once, and slice it and dice it as you wish.

    It is common practice to implement summary tables (and other good methods) to provide aggregation, but this is only valid for data that is not changing (eg. for historic values). Again, Denormalisation for this purpose is simply wrong.

    I do not know what a "centipede" design is, so I cannot say.

    Have fun

    Cheers
    Last edited by DerekA; 08-19-06 at 02:42.
    Derek Asirvadem
    Senior Sybase DBA/Information Architect derekATsoftwaregemsDOTcomDOTau
    Anything worth doing is worth doing Right The First Time
    Spend your money on standards-compliant development or spend 10 times more fixing it

  6. #6
    Join Date
    Sep 2002
    Location
    Sydney, Australia
    Posts
    255

    "Large" Tables

    What's wrong with millions of rows ? It is quite ordinary in a database or warehouse. I do not understand why some people think there is a performance hit on a large table, in fact the opposite is true: for contentious systems, a larger table provides more spread (less contention); no one user is going to read the entire table, they read parts, usually parts which are spread out, via indexes (anyone still doing table scans should taken out the back and gently put of their misery). Just keep the indexes tight and index stats maintained, reorged, whatever. Only one level is added to the depth of a B-tree if a 16m row table doubles to 32m rows. Most vendors allow partitioning of tables and placement of these partitions on multiple disks but that is only relevant for large-scale proccessing of such tables (DSS), not OLTP. (Given an HP starting point), I only bother to tune a table after it gets past a million rows. I currently have two with over 16 billion, in the above atomic fact structure - response time is seconds for most reports; aggregation gets close to a minute toward month end due to new data being spread out; I rebuild the indexes once a month. Any commercial (not freeware) SQL engine except NecroSoft will handle it, but really it deserves a machine a bit bigger than a postage stamp or a matchbox (or many matchboxes stacked together).

    Cheers
    Derek Asirvadem
    Senior Sybase DBA/Information Architect derekATsoftwaregemsDOTcomDOTau
    Anything worth doing is worth doing Right The First Time
    Spend your money on standards-compliant development or spend 10 times more fixing it

  7. #7
    Join Date
    Dec 2003
    Location
    Canada
    Posts
    710
    Derek, you have obviously made a considerable effort to answer the question, however, parts of your answer are misleading. First, you talk about "pure IDEF1X", which is a form of notation, yet you use crowsfoot notation. Second, you imply that not using surrogate keys is "pure" when it is only the difference between dependent and independent relationships. In fact your example is one of the few cases where not using surrogate keys is advisable, however there are alternate designs for the case that make your model irrelevant. Third, your explanation does not provide for cases where valid values are freeform text or scalar.

    "Dimensional" is one of the most whored words in the database vocabulary and you just added to it. Dimensional data modeling was developed specifically due to performance limitations of both databases and hardware and the need for users to have simpler models for reporting. Teradata and NCR hardware are showing signs of making dimensional models a thing of the past, but while they are with us let's do them justice. Source data is translated into an operational data store and an operational data store is translated into a data warehouse. You do not perform transactional operations within a datawarehouse. It is simply a repository for results that is queried.

    The problem domain for this question is very small. There are only two dimensions and one fact table. The number of rows is really irrelevant as far as any RDBMS on the market is concerned. The number of joins is so small they are irrelevant as well. The problem does beg the question, "Is a data warehouse at all necessary?"
    Attached Thumbnails Attached Thumbnails untitled.JPG   untitledA.JPG  

  8. #8
    Join Date
    Aug 2006
    Posts
    4
    Hi Derek

    Clearly u have taken a lot of time and put in a lot of effort to reply to my question. I must say that your effort is worth lots of gratitude and has exceeded my expectations. I do agree with many of your thoughts. I think it was my mistake to not clearly specify my question in the beginning and please accept my apologies for any miscommunication.

    There is only a form that every patient fills. Only each patient can be asked to fill the same form multiple times over a span of time and hence multiple assessments for the same patient.

    I agree that less code we have to manage or write the better. And descriptive column names can be represented in the report without any new name mapping between the db and report. And also that un-normalized structure is most difficult to manage and de-normalized structure is more difficult to manage than normalized design.

    And yes, an answer might be the same in more than one question, as you said - e.g. once a week, twice daily, etc. And storing them once in a normalized design will save disk space. I am not an expert in IDEF1X design specifications and will leave that for now, until I finish my home work about it. And I agree that cross joining answers/questions in combinations of 10 tables is a tedious task and will be tough to manage. But at the same time it offers 1 row per assessments and all the answers in the assessment. It will be a lot faster than reading 120 rows per assessment.

    Your statement

    "I have never seen a "Denormalised" design that did not have a massive maintenance burden. I have worked on several that were "denormalised for speed" which were a joke, as the truth was denormalisation caused additional overhead in every retrieval. Denormalisation is always due to the limits of the designer's brain, and nothing else. Remember, joining multiple tables is ordinary in a relational db. You can use Views to simplify (flatten) things for users or reports, that's what they are for."

    is worth consideration. We are using MS SQL Server 2005 as the RDBMS. This will be a reporting data store (mart). We need to consider a very important point - performance of queries. ETL process will run nightly – so inserts and updates do not weigh out selects.

    The whole idea of de-normalization is to reduce the number of joins and hence reduce data page reads from multiple tables and nested loops, hash and merge operations in the lower level of the RDBMS System.

    There are lots of other hardware factors involved - disk rpms, memory size, virtual memory size and contention, CPU, etc. So in order to go the best route – we need to tune all these and software design elements as well. You might have only one or two joins, but a disk bottleneck due to contention between different users or applications can effect performance whereas another query with 4 joins can return results faster - so all element need to be tuned for best performance.

    More joins mean that the system has to search for more rows in more than one table and this has to be done on inserts, updates and selects. A value missing in the PK means that you cannot insert that value in the FK and for that decision, the system needs to read the PK column in memory from another table.

    Views are complied at run-time. They have a performance penalty since most of the time we don’t just do Select * from view. We perform more complex processing on the select and joins that views are performing underneath. So performance is adversely affected.

    If u use materialized views, it leads to redundancy of data. These redundancies will double the storage – table and views contain the same data twice. De-normalization also has redundancy but it does not require double the space as materialized views and then space is require for indexes on top of the materialized views. So the whole storage saving is better in de-normalization overall.

    I agree with Certus on this. De-normalization was developed to address performance needs. The only difference between dimensional and relational models is that dimensional is 2nd Normal Form and relational is mostly 3rd NF.

    And "size does matter". That’s why the whole concept of indexing had to be developed.

    Reference tables in your design will surely help in the staging area, where we perform all these checks for validation and cleansing. And I also agree the most atomic level is the best in terms of scalability and maintenance. But performance will pay the penalty. May be an aggregated table is the answer, but testing is certainly needed.

    Once again - I appreciate your invaluable thoughts and effort that you have put in. I now am certainly looking at most atomic granularity because of scalability.

    Thanks
    Gary

  9. #9
    Join Date
    Dec 2003
    Location
    Canada
    Posts
    710
    Consider this design. It will provide you with adequate uniqueness.
    Attached Thumbnails Attached Thumbnails untitled.JPG  

  10. #10
    Join Date
    Aug 2006
    Posts
    4

    Factless Fact with too many dimensions

    Thanks again Certus - thats exactly what I had in mind.

    -Gary

  11. #11
    Join Date
    Sep 2002
    Location
    Sydney, Australia
    Posts
    255

    Post 7

    All

    Come on, guys & dolls, you appear to be treating my post as the final solution. It is not. It is my first effort at answering an admittedly non-specific question. The reason I spent time on it, is that this is an area that I see commonly badly implemented, causing a "reporting database" or datawarehouse to be added on, which (if it were reasonably implemented in the first place) would be entirely unnecessary. Certainly where the database has many tables, or the DSS requirements are intensive, the database and the data warehouse should be separated and optimised for their specific use (transactional db vs DSS), but 80% of the time the transactional db and DSS are simple and can be handled from the same db, and more importnatly, the same box. Most of what I have said above is an attempt to avoid duplicating the db and to allow for DSS/Statistical queries from the single database. Plus the question is a good example.

    Let's see if I can continue with that process in mind. Some items below are more relevant than others, they are in posting order.

    Certus
    Quote Originally Posted by Certus
    First, you talk about "pure IDEF1X", which is a form of notation,
    I disagree. IDEF1X is not merely a notation, it is a Method, and therefore a Standard (references in texts elsewhere): if you give 10 relational db designers a job to design a relational db for a specific requirement, they will come up with 10 different designs, all "relational". If you tell them they have to comply with IDEF1X, they will come up with 3 or 4 models. The IDEF1X Standard makes the application of the rules (the Method) of normalisation stricter; it is a narrower form of the relational (Codd & Date et al) paradigm. Which is why the US Navy come up with it, and all DoD departments use it.

    Separately, being another Standard, it has another nomenclature or notation of its own. The rest of the notation is easy enough for most to understand, but the cardinality (ratio & dependency of the relational line) is cumbersome. In the early rendition of models, this is irrelevant anyway; all we need is the ratio and nullability [1:0-1 .. 1:0-n .. 1:1 .. 1:n]. Most immediately understand either crows feet or double arrows.

    Therefore, I use the IDEF1X Method & Standard for normalisation, the task of modelling, and I draw my models with a mix of Notations, based on need, the audience, using quick-draw tools or a full-blown modelling tool (I do not give the full ERA model to all):
    [indent]- Square & round corners to show Dependence as per IDEF1X
    - Boxes with no attributes for higher levels (ER only)
    - crows feet (IEEE notation) for relations
    - colours for meaningful grouping (Reference, Business Entity, Transaction, History ...)
    - hierachical positioning for immediate rendition of dependency (top-to-bottom only][/ident]
    The later renditions, as the model progresses, using a modelling tool like ERwin or PowerDesigner (having left the OmniGraffle quick-draw renditions behind) have full attribute defns (ERA). Even these tools allow for application of the IDEF1X Standard but allow you to choose IDEF1X or IEEE notation for the relation lines.

    yet you use crowsfoot notation.
    Just because that Notation is understood by more people (I am not a purist, even though I may look and sound like one!).

    Second, you imply that not using surrogate keys is "pure" when it is only the difference between dependent and independent relationships.
    Absolutely not, I disagree, but I am not going to explain that here. The considerations in choosing surrogate keys are much more than that.

    In fact your example is one of the few cases where not using surrogate keys is advisable.
    I agree. And I avoid surrogate keys in all but the top (reference) level.
    Third, your explanation does not provide for cases where valid values are freeform text or scalar.
    Look, it is not the finished product, it is a example/answer only. Feel free to add attributes, as long as you put them in the right (1:1) table after due consideration.

    Eg. Unless I knew you were going to use high-end DB/DSS tools (eg. Sybase puts the text columns in a separate physical table, invisible to the developer/user), I would hesitate to put text or image columns in the Answer table, due to the query/speed requirements, even though that is the correct place for the relational requirement.

    "Dimensional" is one of the most whored words in the database vocabulary
    Totally agree.
    and you just added to it.
    Then I failed completely in my explanation. I was purposely trying to stay away from the word as used in the relational or db sense; I identified the word as relevant for reporting/warehousing (which is where it came from, and the bastards have done a poor job of implementing it back into dbs); and attempting to show the relevance of designing a relational db using relation (not dimensional) methodology, PLUS adding dimensional (reporting, not relational or db) requirements to it.

    There are only two dimensions and one fact table.
    True to a point. Gary has asked for two. I have identified a third as required for growth/change. Gary has implied several more (Person, Hospital ...)
    The problem domain for this question is very small. The number of rows is really irrelevant as far as any RDBMS on the market is concerned. The number of joins is so small they are irrelevant as well. The problem does beg the question, "Is a data warehouse at all necessary?"
    That, is exactly my point. Basically we agree. Gary does not need a warehouse. However, he does need a db to put the data into transactionally first ... and then he needs to perform DSS/statistical (dimensional) queries on it.

    You do not perform transactional operations within a datawarehouse. It is simply a repository for results that is queried
    That's where we disagree. When the db is NOT so large, I do transactional operations on it, as well as provide very decent DSS/warehouse capability, from the same db, rather than having one for loading transactions, and another for querying/stats.

    If and when the warehouse requirement warrants, you can spend another wad, and add a warehouse layer/product/db. However, make no mistake, again, the most important issue when you do that, is whether the source db is normalised or not. Badly normalised (unnormalised or denormalised) designs will require intermediate load tables to be created and generally impede both the loading process and the query speed.

    Your Diagram One
    This is virtually identical to [the compared tables in] my diagram, so obviously I agree with it. I disagree with:
    - your column names
    - two duplicate columns in the Fact table (simply unresolved)
    It is IDEF1X notation, but not an application of IDEF1X standard or method. If one wishes to compare the key rendition/migration, then compare with post 5 Dimension Key, I have named the columns according to standard and identified the keys only.

    Your Diagram Two
    I disagree with this because it:
    - freely and unnecessarily uses surrogate keys in the intermediate level, where the parent keys are already (a) required, and (b) form uniqueness (c) three next-sequential problems are introduced
    - for the DSS/Statistical queries, it forces joins (extra joins) between the Fact table and the intermediate tables , in order to get to the reference (top level) tables. My design (the reporting/DSS/statistical query requirement is rendered in post 5, Dimension Rpt) is therefore superior in terms of both storage and speed. Note my discussion re Keys.
    - again the column names
    In the Fact table, you end up with exactly the same number (three) of keys (for the compared tables) that I end up with (refer post 5 Dimension Key), but not the same exact keys. My keys are (a) a result of application of the IDEF1X method and standard [forget the notation], and (b) therefore retain their RELATION to the top-level Reference tables (refer Dimension Rpt). Note my discussion Keys

    Certus, I love your "one fact in one place" signature, it is the goal of normalisation, but you are contradicting it.

    Cheers
    Derek Asirvadem
    Senior Sybase DBA/Information Architect derekATsoftwaregemsDOTcomDOTau
    Anything worth doing is worth doing Right The First Time
    Spend your money on standards-compliant development or spend 10 times more fixing it

  12. #12
    Join Date
    Sep 2002
    Location
    Sydney, Australia
    Posts
    255

    Post 8

    Quote Originally Posted by gary8877
    There is only [one] form that every patient fills. Only each patient can be asked to fill the same form multiple times over a span of time and hence multiple assessments for the same patient.
    Look, I do not mean to offend, but if you take this as gospel (even though the user promises it), you are nuts. Can you absolutely confirm that:
    - between now and system delivery the form will not change
    - between system delivery and end-of-life the form will not change
    Plus, if you sell the system to another state or health services network, you are going to look silly if you can only handle the form from the original state and date. Just because the user has concrete ideas does not mean you should implement them as concrete limits in the system or db. If you did, you will implement a 120-column Answer table, and every time there is a new Answer to an existing Question, or a new Question is added:
    - change ddl
    - unload/reload table
    - mess around with new indexes vs old, etc
    - change all code

    Plus, you cannot handle a second form or three. I proposed my normalisation of AnswerReference more for the ability to change and handle multiple forms/states/health areas, than for speed/storage requirements, but it does give you the fastest spped on the Answer table as well.

    And I agree that cross joining answers/questions in combinations of 10 tables is a tedious task and will be tough to manage. But at the same time it offers 1 row per assessments and all the answers in the assessment. It will be a lot faster than reading 120 rows per assessment.
    That last item is completely untrue. Let's handle this item independent of anything else. This is a common myth. Given the same on-board server resources, query speed is not based on no of rows, it is based on no of disk blocks read from disk and brought into memory (data cache), every commercial db is based on processing a page which has multiple rows. Without arguing about finer points re column width, the
    120-column table x rows / rows per page
    will take the same space as the
    1-column table x 120 x rows / rows per page
    because it is aprrox the same no pages for the table. Of course, on large tables, indexing, WHERE clause construction are key performance issues, but the no pages as a fundamental requirement is the same, if you are performing aggregation you will be tending towards table scans rather than index scans anyway.

    Your statement ... re Dernormalisation ... is worth consideration. We are using MS SQL Server 2005 as the RDBMS. This will be a reporting data store (mart). We need to consider a very important point - performance of queries. ETL process will run nightly – so inserts and updates do not weigh out selects.

    The whole idea of de-normalization is to reduce the number of joins and hence reduce data page reads from multiple tables and nested loops, hash and merge operations in the lower level of the RDBMS System.

    There are lots of other hardware factors involved - disk rpms, memory size, virtual memory size and contention, CPU, etc. So in order to go the best route – we need to tune all these and software design elements as well. You might have only one or two joins, but a disk bottleneck due to contention between different users or applications can effect performance whereas another query with 4 joins can return results faster - so all element need to be tuned for best performance.

    More joins mean that the system has to search for more rows in more than one table and this has to be done on inserts, updates and selects. A value missing in the PK means that you cannot insert that value in the FK and for that decision, the system needs to read the PK column in memory from another table.
    You've said a lot here, and we can go on for days on the subject matter, most are not relevant at the big end of town, I admit they are relevant on the boxes/RDBMS you have. Suffice it to say that MS SQL Server <insert year> does one thing on paper and something else on chips. Let me just respond to the "no of joins" issue, which I addressed in my post.

    Another Myth. Joins are ordinary - that is how you relate data in a relational db. Joins are not "processing", parsing data in order to perform a join is processing. Therefore keep the joins (note migrated keys) pure. A logical table is simply a View (not the meaning below) of multiple joined physical tables, and commercial RDBMS are designed for that. Multiple joins of NORMALISED data are effortless and absolutely do not affect processing or access speed. Consider my diagram, note that the reference tables are tiny, say the fact table is 16 million rows. On commercial (I exclude MS as a commercial possibility) DBMS, I would cache the entire content of all reference tables (configuration, not tuning), therefore zero access penalty. 16m rows by 200 rows per page (or 1.6m rows by 20 rows per page) would stream through a configured cache (what you guys call pinning [the tail on the donkey]), and join with members already in the cache. What is the problem ? Contention for the table is not undesirable, as the contending users would be served faster (than if they were serial), as the intent of the cache IS to share.

    You may think that denormalised tables cause less joins, the fact is, in any real (as opposed to imagined, evaluated) implementation, they actually cause more joins, as well as joins to other denormalised tables and therefore more processing. The problem is in processing, not joins. In my RPT diag, I am attempting to show that for reports, you can dispense with the intermediate tables and do straight joins to the reference tables, causing less tables and joins in your select. Secondly, this is only possible if the data is truly normalised, and keys are migrated (from the parent to the child and thus to the grandchild) properly.

    Once on a MS SQL Server/box, I rewrote a set of "slow" reports. Took it down from 14 hours to 15 minutes. No change to hardware or platform software or data. Replaced 16 selects with 4 joins of denormalised tables. It was not magic, all I did was (a) normalise the db (b) in order to end up with four selects of 16 joins (the then maximum, so it was a design criteria). Ok, my SQL was a bit tighter, but there were no tricks.

    DO NOT be afraid of the "more" joins resulting from normalisation [thin tables], DO be afraid of denormalisation [fat tables with duplication]. Often the fear of normalisation is actually a fear of transaction design, trying to do too much at one time, ending putting it all in one "table", rather than doing that properly and putting the bits of the transaction in the correct places [tables].

    Views are complied at run-time.
    Not on my server, not since 1997 !
    They have a performance penalty since most of the time we don’t just do Select * from view. We perform more complex processing on the select and joins that views are performing underneath. So performance is adversely affected.
    Look, views have their place, you are not supposed to do processing in a view.
    I do not allow views of views.

    If u use materialized views, it leads to redundancy of data. These redundancies will double the storage – table and views contain the same data twice.
    Well of course, "materialised views" is just double speak for an intermediate physical table (that is intended to overcome the problems of denormalisation in the source, where unplanned joins are made). Good for the vendor, hideous for the DBA. Just another name for automatically created denormalised (data duplication) tables. Dog's breakfast with champagne.

    De-normalization also has redundancy but it does not require double the space as materialized views and then space is require for indexes on top of the materialized views. So the whole storage saving is better in de-normalization overall.
    Not overall, only when compared to 100% redundancy of materialised views. Normalisation means zero redundancy and zero redundant indexes, so denormalisation is insanity (penalty for data storage; data processing; data loading) when compared to normalisation.

    I agree with Certus on this. De-normalization was developed to address performance needs.
    That's hilarious, Denormalisation was not "developed" (there are no methods or standards for it); that's like saying fat people were developed to overcome the mental speed of thin people. If denormalisation was allowed, 10 different modellers will come up with 100 different denormalised models. I put it another way (in post 5): denormalisation is merely the resulting implementation, limited by the capability of the developer, due to the lack of understanding and experience. They may well have had performance IN MIND, but the fact is Denormalisation IMPEDES performance. In every site that I have handled this problem, the developers never hung around long enough to prove that Denormalisation was faster, or to correct it when it was proved slower. The developer keeps ranting about the drunken good times he had (myth), some other poor bastard had to clean up the footpath the next morning (fact).

    The only difference between dimensional and relational models is that dimensional is 2nd Normal Form and relational is mostly 3rd NF.
    I do not believe in "dimensional" model. Details above. It is not progressing, so I would not use it..
    2NF by definition means normalisation has not been completed. In Codd & Date's (and my) book, 3NF is minimum before you can call it either "relational" or a "database", 4NF as applicable must be considered.

    Cheers
    Derek Asirvadem
    Senior Sybase DBA/Information Architect derekATsoftwaregemsDOTcomDOTau
    Anything worth doing is worth doing Right The First Time
    Spend your money on standards-compliant development or spend 10 times more fixing it

  13. #13
    Join Date
    Sep 2002
    Location
    Sydney, Australia
    Posts
    255

    Post 8 continued

    Quote Originally Posted by gary8877
    And "size does matter". That’s why the whole concept of indexing had to be developed.
    If you understand that, then what in God's name are you doing trying to shoehorn a database into a matchbox ? Or a stack of matchboxes (with the added communication and memory contention problems) ? And trust MicroShaft (who have never delivered what they promised) to deliver what they promise now and get shafted in the process. What on Earth makes people think that next time it will somehow be different. All they do is promise to extend the "capability" of tiny boxes, while remaining chained to them. Sixteen 1000cc racing motorcycles with the best inter-driver comms will never do the job of a pickup truck, just watch them turning corners or emptying a [physical] warehouse. Why not consider a box sized for the job, supplied by many vendors.

    Reference tables in your design will surely help in the staging area, where we perform all these checks for validation and cleansing.
    They are not a "help", they are demanded for Referential Integrity. Not only during the cleansing, but for an ordinary relational database.
    And I also agree the most atomic level is the best in terms of scalability and maintenance. But performance will pay the penalty.
    That's a thought, not a fact. My argument is atomic fact tables are in fact (by experience, not unapplied thought) faster, not slower. This is detailed above.

    If the scaleability you keep mentioning relates to DSS/statistical queries, and it is a design criterion, then you cannot afford to denormalise. This is a forum on Concepts & Design [of Relational DBMS], so I can accept that it is more theoretical than practical (which the other forums are). Nevertheless, I find the idea of people validating denormalisation (which is AGAINST the standards, concepts and design of the relational paradigm, entirely contradictory. I accept that that is due to lack of experience. Therefore, I will not enter into any further arguments or discussion re the alleged "benefits" of fat duplicate data, unless the writer uses a real production implementation, as I am doing.
    May be an aggregated table is the answer, but testing is certainly needed.
    Model and test both with a full volume of data. An aggregate table can be added, if necessary, and it is simple if you already have the atomic table, but the design should not be started with an aggregate table in mind. Even if you did denormalise, you should first execute a correct 3NF normalisation, then denormalise (note post 5 re Denormalisation), otherwise it is Unnormalised.

    Once again - I appreciate your invaluable thoughts and effort that you have put in. I now am certainly looking at most atomic granularity because of scalability.
    Thanks for the feedback.
    Derek Asirvadem
    Senior Sybase DBA/Information Architect derekATsoftwaregemsDOTcomDOTau
    Anything worth doing is worth doing Right The First Time
    Spend your money on standards-compliant development or spend 10 times more fixing it

  14. #14
    Join Date
    Sep 2002
    Location
    Sydney, Australia
    Posts
    255

    Post 9

    1 I continue to have a serious problem with the naming of columns, esp. key columns, but if I overlook that, this diagram is perfect for the datawarehouse/DSS component of the requirement. It is basically the same as the Dimension Rpt diagram.

    2 For the RDBMS side, which requires transactional control, referential integrity, etc, which requires a relational db, which requires normalisation, I think it does not supply the requirement. Just one Eg. there are no intermediate tables which identify/define the actual valid (for verification/RI purposes) Question/Answer or Assessment/Question combinations. Note the keys are identified in Dimension Key.

    Cheers
    Derek Asirvadem
    Senior Sybase DBA/Information Architect derekATsoftwaregemsDOTcomDOTau
    Anything worth doing is worth doing Right The First Time
    Spend your money on standards-compliant development or spend 10 times more fixing it

  15. #15
    Join Date
    Dec 2003
    Location
    Canada
    Posts
    710
    gary8877 asked for a dimensional data warehouse.

    Let's stop doing backflips and give him one.

    There's mud in your eye. eh.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •