Results 1 to 11 of 11
  1. #1
    Join Date
    Mar 2008
    Posts
    11

    Question Wide versus long

    I have two problems. The first is a general ignorance. The second is as follows:

    The setup:
    I have two tables: restaurants, characteristics.

    These restaurants 'possess' the characteristics: owner, phone number, and Zagat rating (and about 100 other characteristics). In total, I am dealing with about 200,000 data points ('cells', if this were a spreadsheet, which it is not).

    The problem:
    I am trying to decide if characteristics is better suited to being wide or long.

    The characteristics are pretty well suited for creating a wide table with many columns; however, there is one caveat: many of these characteristics change on a variably variable basis (that is, different restaurants' properties change arbitrarily often). Furthermore, different characteristics belonging to the same restaurant change at different times. A restaurant's Zagat rating may change in May, and then it gets bought out in July, for example.

    I would like to keep a history of these changes, and keep track of who made the change (yes, there is a 3rd table, but I believe we can essentially keep that out of this discussion). I don't just want to overwrite the Zagat rating; I'd like to be able to track changes in Zagat rating over time, for example.

    The way that I've framed this problem for myself is "wide" (many columns, few rows) versus "long" (many rows, few columns). The benefit of wide is that it makes lookups very pleasant. The problem is that it makes it difficult to keep track of who changed which "cell", and it also makes it difficult to add in "new old values" (for example, in 2008, I might want to go back and add in a Zagat rating from 2005).

    The benefit of long is that I get the sense that it will help me better keep track of changes. For example, with "long", I can simply add a new row saying "Restaurant1|5 stars|date", which will make it easy to update, and easy to show either the latest rating or the history of all ratings. I get the feeling that this makes insertion and tracking a breeze, but at the expense of more intensive lookups in terms of queries and CPU.

    I'd love to get any of your opinions on this matter before I arbitrarily decide which to choose.
    Last edited by nineinchnailgun; 03-09-08 at 01:57.

  2. #2
    Join Date
    Jun 2004
    Location
    Arizona, USA
    Posts
    1,848
    As a general rule, it is easier to work with data that reside in a tall table structure, rather than a 'wide' structure.

    SQLQ's aggregate functions, for instance, work on columns, not rows.
    Lou
    使大吃一惊
    "Lisa, in this house, we obey the laws of thermodynamics!" - Homer Simpson
    "I have my standards. They may be low, but I have them!" - Bette Middler
    "It's a book about a Spanish guy named Manual. You should read it." - Dilbert


  3. #3
    Join Date
    Mar 2008
    Posts
    11
    I guess my Stata days have made me think of many columns as "wide" and many rows as "long" or "tall".

    Anyway, I agree that the many-column approach allows for more computational power and simplicity, but I'm not sure how to strike a balance between that and the need for keeping a history of relevant modifications to the data.

  4. #4
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    I vote tall as well

    Code:
    restaurants
       restaurant_id
       ...
    
    characteristics
       restaurant_id
       characteristic
       characteristic_value
       upd_user_id
       from_time
       to_time
    
    users
       user_id
       ...
    This allows you to change keep a track of all historic values, when they were changed and who by. I'd store all the characteristics in one table for simplicity. This way does not use the relational integrity available from the database but it's easy to test via the code that the characteristics and their values are valid. It also makes the code quite easy to write and easy to add new characteristics etc in the future.

  5. #5
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    want relational integrity mike? add a characteristics table

    stop with the code, already

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  6. #6
    Join Date
    Mar 2008
    Posts
    11
    Quote Originally Posted by mike_bike_kite
    I vote tall as well

    Code:
       ...
    This allows you to change keep a track of all historic values, when they were changed and who by. I'd store all the characteristics in one table for simplicity. This way does not use the relational integrity available from the database but it's easy to test via the code that the characteristics and their values are valid. It also makes the code quite easy to write and easy to add new characteristics etc in the future.
    Thanks for the feedback. Are you sure it doesn't use relational integrity? Perhaps I'm misunderstanding, but I thought that tracking the restaurant_id would be sufficient.



    So far, I'm seeing 1 vote for many columns and 1 for many rows.

  7. #7
    Join Date
    Mar 2008
    Posts
    11
    Quote Originally Posted by r937
    want relational integrity mike? add a characteristics table

    stop with the code, already

    r937 - I was looking through some of your old columns and this one intrigued me: Using GROUP BY to collapse rows

    This looks like a combination of the many-row and many-column approaches. Like the many-column approach, I *think* it still gives me a good amount of power for performing comparisons, calculations, etc. (but please let me know if this is wrong). And, like the many-row approach, I can track history. I'm not yet sure that this is the best approach for me, but I have a sense that it's got a combination of advantages without adding any new disadvantages.

  8. #8
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    What you call "tall", we know as EAV, and what you call "wide" we know as Relational.

    EAV is very fast to enter, you can create data in the blink of an eye. However, EAV is almost unstructured so retrieving data from an EAV is nearly chaos incarnate.

    A Relational Model means that you have many columns, one of which can be something like "effective date" and another of which can be "user that created this row" which gives you the history that you wanted, in a clean and highly structured way.

    Mike_bike_kite is a general proponent of the EAV model. The vast majority of us avoid EAV like the very plague (because it has affected us much like a plague in the past).

    I straddle the fence, and can see reasons why a seasoned professional trying to deal with truly chaotic data (where there is no inherant rhyme, reason, or structure) might accept the risks associated with an EAV model. I would never recommend it to anyone with less than five years of experience dealing with databases, or to anyone that had another reasonable choice. Out of thousands of data structures that I've modeled over a twenty plus year period, I've successfully used EAV twice and have replaced EAV systems with relational ones about a hundred times.

    -PatP

  9. #9
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    sorry, nine-inch (may i call you nine-inch? heh)

    your best bet for help on this topic is mike, who is comfortable with EAV schemes

    EAV = entity-attribute-value (worth searching one)

    me, i would use a wide table, with clearly defined columns

    edit: pat beat me to the reply, nice that we actually agree on something for a change
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  10. #10
    Join Date
    Mar 2008
    Posts
    11
    Wow, thank you both for such thoughtful replies. Pat, thanks especially for translating my thoughts into useful lingo for these boards; that will help me quite a bit going forward. r937, you can call me nine-inch, nin, or, hell, even 'n' will do.

    I'm still thinking of using a wide table with clearly defined columns, somewhat following your (r937) "Using GROUP BY to collapse rows" article that I linked to in my last post.

    Once I use GROUP BY, can I perform most of the same operations that I normally would (i.e., SELECTing restaurants where the rating is >3)?

    Example (for proof of concept, not the real schema):

    Code:
    restaurant    year   zagat   nyt       submitter
    ----          ------ ------  ------   --------
    bellsbury     2008    5                    2
    bellsbury     2007    4                    33
    tacoby        2007           2             12
    tacoby        2008    5                    17
    bellsbury     2008           4             84
    tacoby        2007    3                    7
    I need to collapse these down to:
    bellsbury, 2008, 5, 4
    bellsbury, 2007, 4,-
    tacoby, 2008,5,-
    tacoby, 2007,3,2

    I'm assuming I can use GROUP BY on `restaurant`, then gathering the other values (even if they are strings???) with MAX() to get this result.

    Could I then perform normal operations, such as:
    SELECT WHERE year=2008 GROUP BY restaurant

    to get:
    bellsbury, 2008, 5, 4
    tacoby, 2008, 5, -

    ?

    So at this point, my question isn't about that syntax per se (sure, it's probably wrong); the question is, is there any syntax that will ultimately let me manipulate a table set up like this as if it were a normally populated table with one row per restaurant? If so, this seems like a workable possible solution that maintains the relational power (but not integrity) of columns, and the history-keeping power of an EAV method. However, I don't know for certain, which is why I ask.

    FWIW, I can now see how relational integrity may be compromised here (what's to stop someone, in theory, from adding another bellsbury 2008 zagat rating?).

  11. #11
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    I know I'm on my own by ever proposing an EAV solution (even where it suits the problem area) but I'll supply a few examples to keep the thread balanced.

    Code:
    select Name from Restaurant where characteristic = 'rating' and to_date = MAX_DATE and convert( int, characteristic_value ) > 3
    I'd use a procedure to store values which :
    • checks that the characteristic is valid (have a lookup table of valid characteristics and any meta data required)
    • checks that the characteristic value is valid (have a general lookup table where you can provide valid values for any characteristic)
    • ensures that the from and to dates are maintained.

    You can have one just one insertion date which means you don't need to maintain the (from and to) dates but it does make the extraction of values slightly more complicated. I'd recommend having a function to pull a characteristic from the database for a given id - coding then gets very easy.
    Code:
    select	Name
    from	Restaurant
    where	convert( int, get_characteristic( id, 'rating' ) ) > 3
    The interesting bit is when you want to look at all the old values of rating (plus who added each value and when) for a given restaurant. Remember that intermingled with the changes to rating might be changes to other fields where the rating hasn't altered. Also what happens if someone introduces a new characteristic that needs to be stored? with the tall (EAV) approach you just add some data and with the wide approach you need to alter your database structure. I think it gets very messy with the wide design but it's simplicity itself with the tall (EAV) approach.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •