Results 1 to 15 of 15
  1. #1
    Join Date
    Nov 2010
    Posts
    12

    How to Implement Complex Data in a DB

    I am trying to figure out how to implement a database that will store information about a wide range of resources (books, web sites, etc.) for an audience of teachers. The problem I'm having is trying to conceptualize the best way to store the data which I'm importing from a wide variety of source files, each of which contains somewhat different information about each resource. In addition, some resources are offered in different formats from different vendors, but contain the same "work of intellectual property" (for my example, I'm using the book "The Adventures of Huckleberry Finn" by Mark Twain, which is available as a book from many vendors, and as a text file in many locations on the web).

    The "tricky part" seems to be finding an effective way to store information about the relationships between resources, including resources that are "about" other resources. I'm coming up with a complex system of attributes and relationships, but I can't really figure out the "right" way to implement a database to store and retrieve the information. I'm also having "overloading" problems, such as how to deal with title variations for a book, or Pen Names and Real Names for authors, multiple authors, SKUs and isbns, etc.

    Below is an example, using a pseudo-XML format, to try to define the information I'm trying to store:

    Code:
    <work>
      <id>Adventures of Huckleberry Finn</id>
      <title><prefix>The</prefix> <main>Adventures of Huckleberry Finn</main></title>
      <person><role>Author</role> <name><first>Mark</first> <surname>Twain</surname></name></person>
      <date><event>Publication</event> <year>1884</year></date>
    </work>	
    
    <work>
      <id>Using The Adventures of Huckleberry Finn in the Classroom</id>
      <title>Using The Adventures of Huckleberry Finn in the Classroom</title>
      <person><type>Publisher</type> <name>Teacher Created Resources</name>
      <attribute><length>48 pages</length></attribute>
      <description>Reproducible worksheets and resources for teaching Huck Finn.</description>
    </work>
    
    <person>
      <id>Mark Twain</id> <id>Samuel Clemens</id>
      <name><type>Main</type><type>Pen Name</type> <first>Mark</first> <surname>Twain</surname></name>
      <name><type>Real Name</type> <first>Samuel</first> <surname>Clemens</second> </name>
      <date><event>Birthdate</event> <month>Nov</month> <day>30</day> <year>1835</year></date>
      <date><event>Death</event> <month>Apr</month> <day>21</day> <year>1910</year></date>
      <place><type>Birthplace</type> <city>Florida</city> <state>MO</state></place>
      <place><type>Childhood residence</type> <city>Hannibal</city> <state>Missouri</state></place>
    </person>
    
    <resource>
      <id><type>isbn-10</type><value>0812504224</value></id>
      <id><type>isbn-13></type> <value>978-0812504224</value></id>
      <work><id>Adventures of Huckleberry Finn</id></work>
      <attribute><type>Publisher</type> <name>Aerie</name></attribute>
      <attribute><type>Edition</type> <value>Unabridged</value></attribute>
      <attribute><type>Format</type> <value>Mass-Market Paperback</value></attribute>
      <attribute><type>Length</type> <value>320 pages</value></attribute>
    </resource>
    
    <resource>
      <id><type>url</type><value>http://etext.virginia.edu/toc/modeng/public/Twa2Huc.html</value></id>
      <work><id>Adventures of Huckleberry Finn</id></work>
      <publisher><name><University of Virginia Library</name></publisher>
      <description>Full text of Huck Finn, by chapters or the entire work on one page.</description>
    </resource>
    
    <resource>
      <id><type>URL</type><value>http://www.amazon.com/Adventures-Huckleberry-Finn-Mark-Twain/dp/0812504224</value></id>
      <resource><id><type>isbn-10</type><value>0812504224</value></id></resource>
      <name><type>Vendor</type> <value>Amazon.com</value></name>
      <attribute><price>$13.99</price></attribute>
    </resource>
    
    <resource>
      <id><type>purchase-url</type><value>http://www.teachercreated.com/products/a-guide-for-using-the-adventures-of-huckleberry-finn-in-the-classroom-0564?sku=0564EN</value></id>
      <work><id>Using the Adventures of Huckleberry Finn in the Classroom</id></work>
      <person><type>Vendor</type> <value>TeacherCreated.com</value></person>
      <attribute><price>$13.99</price></attribute>
      <attribute><version>Enhanced E-book</version></attribute>
      <attribute><format>PDF</format></attribute>
    </resource>
    I'm considering several options, such as using the free version of MarkLogic's Mark Server ("native XML" database), but my hunch is that I really need to find an implementation using MySQL.

    Any feedback or suggestions are welcome!
    Last edited by MarkWelch; 11-10-10 at 18:59.

  2. #2
    Join Date
    Nov 2010
    Posts
    12
    I've just spent some time googling "Entity-Attribute-Value" and I definitely see this as an option, but I'm extremely concerned about the potential complexity and CPU load of queries against data stored using an EAV model, especially if I want a relatively complex collection of classes or types. I'm definitely starting to succumb to buzzword-and-jingo-overload, as I try to comprehend each author's precise meaning when using terms like "class" and "resource."

    I'm less concerned about the loads associated with metadata, in part because I don't expect to implement many constraints, but I suspect that perhaps I'm being too simplistic or optimistic about the need for metadata.

    This decision process is starting to look like a much bigger project than I'd initially anticipated, and I'm definitely thinking that it might make sense to spend some time looking at a tool like MarkLogic Server.

  3. #3
    Join Date
    Nov 2010
    Posts
    12
    After reflecting on this for a while, I've started to come up with a design that seems likely to work. I used MySQL Workbench to diagram the table structure (attached to this post as a PDF).

    I invite any feedback or suggestions.
    Attached Files Attached Files

  4. #4
    Join Date
    Nov 2010
    Posts
    12
    And yet another update, this time an image showing the tables and relationships.
    Attached Thumbnails Attached Thumbnails lp-diagram2.jpg  

  5. #5
    Join Date
    Nov 2010
    Posts
    12
    And here is my latest attempt to organize the data in a useful way.
    Attached Thumbnails Attached Thumbnails lp-diagram3.jpg   lp-sc.jpg  
    Last edited by MarkWelch; 11-13-10 at 02:09.

  6. #6
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by MarkWelch View Post
    And here is my latest attempt to organize the data in a useful way.
    that ERD looks pretty good

    i guess the test now is to see how well your data fits that model

    have you tried loading any data into those tables?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  7. #7
    Join Date
    Nov 2010
    Posts
    12
    Thanks for your feedback.

    > have you tried loading any data into those tables? <

    That's what made me realize that the earlier data structure wasn't workable: when I started to flowchart the data-import process. I knew I'd need to add at least one more table, but I couldn't figure out what to do.

    When I realized that my efforts to change to the data structure were not progressing, I changed strategies: I mocked up some display pages (since the entire database is intended to sit behind a web site; the mock-up is attached in my prior post).

    This made it much easier for me to identify the data that needed to be broken into a separate table ("products," which is a refined version of my original concept of "resources"), and then to identify the need for two other tables ("versions" and "merchants") to provide somewhat richer data that will be displayed via onMouseOver textboxes.

    My next step, today, is to try to write an import script, and then once I have a small set of data in the database, write the PHP code to display the intended page content.

    Unfortunately, I've also discovered that there are some very huge gaps in the source data; to fill those gaps, I must add a huge set of new "editorial" tasks, which will require that I create a set of manual data-entry and content-editing pages (which I'd hoped to avoid, at least in the early stages).

    "Every simple project, once begun, will rapidly expand in complexity until it exceeds the available time and resources."

  8. #8
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Any feedback or suggestions are welcome!
    I only had time for a quick glance at your design so forgive me if I missed something.
    • Where do you store what the valid attributes for a given table or can anything be stuck in any attribute table?
    • Why do you have multiple tables that are all pretty much identical ie the attribute tables and the relationship table?
    • Why is price a varchar?
    • Shouldn't the version table have a date?
    • Shouldn't category just be a simple hierachy? if so then the design is wrong?
    • What does the value field in the attribute tables hold?
    • You mentioned having multiple keys for works but I don't see that in your design.

    I'm afraid the design looks a bit of a mess to me.

  9. #9
    Join Date
    Nov 2010
    Posts
    12
    Mike, these are all very good questions, and you've given this more than just a "quick glance."

    > * Where do you store what the valid attributes for a given table or can anything be stuck in any attribute table? <

    For now, anything can be stuck in any attribute table. Yes, this could easily become a very serious problem, especially when I reach the point of tasking other people to designate attributes and values.

    > * Why do you have multiple tables that are all pretty much identical ie the attribute tables and the relationship table? <

    > * What does the value field in the attribute tables hold? <

    None of the tables should store duplicate data. The work-attribute table will list attributes for a particular work (e.g. for a specific work-id/fk, attribute="Publisher"&value="McGraw-Hill" or attribute="PublicationDate"&value="1884" or attribute="isbn-10"&value="123456789X").

    In contrast, the relationship tables signal a relationship between a work and a person, a category, or another work. For example, the work called "A Guide for Using Adventures of Huckleberry Finn in the Classroom" would have an entry in the r-work-work table showing the relationship that it is "about" another work, "Adventures of Huckleberry Finn," and the "value" of that relationship might represent how strongly it is about (for example, this relationship might be considered 100% while another book which discusses 15 great American novels might have an {"about" "Huck Finn"} relationship valued at "20%"). (I still haven't figured this out).

    Then when I generate a page about "The Adventures of Huckleberry Finn," it will include sections for other works (and products) that are "about" the novel, as well as links for derivative works (annotated or abridged versions), links for the novel itself (to buy the book, or download full text from various sources), and a link to the "prequel" (The Adventures of Tom Sawyer).

    * Why is price a varchar?

    And a long varchar, at that. It's because I am seeing too much complexity in pricing (e.g. one item is "$8.75 + $4.75 S&H (order by mail only)," while another might be offered for "$16.99" by Vendor A, but Vendor B has a price of "$17.99 with free shipping." I want lots of flexibility, and I don't intend to allow sorting or searching by price. I must also deal with non-US pricing for vendors that are located in Canada, UK, AU, and NZ.

    My first screen-capture of my mockup page didn't show the variety of data -- attached is another screen capture showing more of what's below the fold.

    > * Shouldn't the version table have a date? <

    No, because I'm not using "version" the way you think; the publication/edition date will be an attribute of the "work." The field product.version reflects versions like "Ebook (PDF)," "CD-ROM," "Mass Market Paperback," "Hardcover," "Workbook," etc.

    > * Shouldn't category just be a simple hierachy? if so then the design is wrong? <

    Nope, it's not hierarchical (a taxonomy), it's a mushy combination of several taxonomies and ontologies. There will actually be several different "category systems" merged into the category, some of which are hierarchical, but any category may have parent, child, sibling, or identity relationships with other categories (and the strength of a relationship will also be an issue). For example, one set of categories will be "the California Content Standards" for various disciplines (English language arts, social studies). Another will be genres of literature; others will be the simplistic category systems assigned by publishers for a particular work (e.g. "US History: Civil War"). The r-category-category table is quite likely to become a HUGE table, and the editorial task of "defining relationships between categories" and "defining relationships between works and categories" will be immense.

    The category system is large and overlapping because that's how knowledge is, and I hope to have a system that allows teachers to search using whatever category system they choose (among those that I implement), and hopefully my system will allow teachers to retrieve information using one category system (for example, if the teacher searches for lesson plans about "Egyptian Architecture," then hopefully my category-relationships will allow my site to display lessons that were classified under the California Social Studies History, Grade 6 Content Standard 2.5 (students should be able to "Discuss the main features of Egyptian art and architecture.")

    > * You mentioned having multiple keys for works but I don't see that in your design. <

    I think you're talking about my "overloading" language in my first post, when I was trying to deal with multiple SKUs for a single product (e.g. isbn, upc); that's why I went with the EAV system in the "-attribute" tables.

    > I'm afraid the design looks a bit of a mess to me. <

    Me, too. ;-)

    And it's going to get more complex, as I realize I forgot to incorporate my intended "scoring system" to use in sequencing items on results pages. This includes grouping results ("Resources for Teachers" and "Resources for Students") and sequencing results (based on editorial evaluation of quality of each resource, or strength of the relationship between an item and the work or category searched).
    Attached Thumbnails Attached Thumbnails lp-sc2.jpg   lp-diagram4.jpg  
    Last edited by MarkWelch; 11-13-10 at 23:24.

  10. #10
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Quote Originally Posted by MarkWelch View Post
    For now, anything can be stuck in any attribute table.
    This means that you'll end up with bad data in there pretty quickly and if you can't trust the data then there's no real point having the database. You also mentioned having no duplicate data but there's nothing stopping you (or a colleague) creating two attributes fname and first_name which means you'll end up with duplicate data.
    Quote Originally Posted by MarkWelch View Post
    CPU load of queries against data stored
    At the moment the only way you can search for a given key is to search every attribute in every attribute table - something tells me this is going to be slow. Putting all your attributes into a single table with a type field simplifies the searches (one table) then having a meta data table about the attributes could limit your searches making them faster.

    Quote Originally Posted by MarkWelch View Post
    I've just spent some time googling "Entity-Attribute-Value"
    At the moment you only appear to have a couple of different types of entity and I just can't see why it wouldn't be easier (and more correct) to use a normal 3NF design. I'm not against EAV myself in the right place but I personally think you'll come unstuck.

    Perhaps our (and your) first questions should have been:
    • What is the aim of your system?
    • Will people use it if they can't sort by price?
    • What benefits does it offer over simply typing "Huckleberry Finn" into Amazon? or Google? or Froogle?
    Last edited by mike_bike_kite; 11-14-10 at 10:46.

  11. #11
    Join Date
    Nov 2010
    Posts
    12
    Quote Originally Posted by mike_bike_kite View Post
    Perhaps our (and your) first questions should have been:
    • What is the aim of your system?
    • Will people use it if they can't sort by price?
    • What benefits does it offer over simply typing "Huckleberry Finn" into Amazon? or Google? or Froogle?
    The aim of my system is to provide a convenient, useful resource for K-12 teachers to use to identify and access "relevant and useful" lesson plan resources which they can use in their classrooms.

    Some background:
    • I briefly explored a career in teaching, from 2002-2004. I quit in February 2004 for a variety of reasons (discussed at "Thoughts on NOT becoming a teacher").
    • After I quit teaching, I decided to compile a list of Literature lesson-plan resources for secondary (grades 7-12) teachers; I published that list on a site called LitPlans.com; the data for that site was stored in a single Excel spreadsheet (one big table with many columns and 25,000 rows of data).
    • I sold that web site to another company in mid-2007 (and agreed not to compete for two years).
    • That web site appears substantially unchanged in more than 3 years.


    I'm now returning to this niche, but planning a more ambitious index of lesson-plan resources for K-12 teachers in all subjects.

    Google generally doesn't provide useful results because it's a resource for "everyone," not just for teachers, and Google offers no genuine quality evaluation. If a teacher searches for "Huck Finn Lesson Plan," and there's a crappy page with the title "Huck Finn Lesson Plan," Google will likely show that page first in the result, and won't show most of the resources listed in my mock-up page. Google returns many results which are "relevant but not useful," but in the first few pages of results, Google never returns many truly relevant results for which the search phrase doesn't appear on the page. Perhaps most important, Google doesn't return results for a specific audience, but for "everyone."

    And from my experience in the classroom, I know that relevance and quality are the primary concerns, with price being secondary. Teachers, especially first-year teachers, are poor in two ways: they have limited time and limited money; when I was teaching, I was glad to spend $20 or $30 for a resource that I knew would save me many hours of time. (During my half-year spent teaching full-time, I spent thousands of dollars on resources for my classroom.)

    In my experience, the majority of "free" resources are a waste of time, but of course there are thousands of exceptions; hopefully my directory will be able to present that "evaluation" information, and will be able to sort results so that the most "relevant and useful" materials appear first, with less-useful and less-relevant materials appearing last.

    Teachers are the primary audience for my web site; I think I have a pretty good idea of what information they'll seek, and I have some ideas about how I think they'll want to access that information.

    In my first project, students were an incidental, unintended secondary audience; the new site will deal more explicitly with the different needs of students, but they're still not a primary audience.

    Finally, I strongly believe that to be successful (that is, to be most "relevant and useful" for teachers), my directory will require a substantial editorial role. I intend to use automation when possible (for example, to identify 404 and 501 errors), but each "collection of resources" that are added into the database will require careful editorial attention from a human being.
    Last edited by MarkWelch; 11-14-10 at 15:40.

  12. #12
    Join Date
    Nov 2010
    Posts
    12
    Quote Originally Posted by mike_bike_kite View Post
    ... you'll end up with bad data in there pretty quickly and if you can't trust the data then there's no real point having the database. You also mentioned having no duplicate data but there's nothing stopping you (or a colleague) creating two attributes fname and first_name which means you'll end up with duplicate data.
    Of course you're right, although your example references one of the "important fields" which won't appear in the EAV table except as "variants" (i.e. "Mark|Twain" and "Samuel|Clemens" will appear in the persons table; in the r-persons-persons table (EERV), an entry will reflect that "Mark|Twain" (entity1) is the "Pen Name of:" (relationship) to "Samuel|Clemens" (entity2). Yes, I realize there's no "value" for this relationship.

    I'm working on a list of the attributes that I would intend to appear in each EAV (*-attribute) table, and I certainly might choose to create a separate *-allowable-attributes table to constrain each of the EVA tables, but I don't think any of the typical "attributes" occur frequently enough to justify including them as fields in a table.

    Quote Originally Posted by mike_bike_kite View Post
    At the moment the only way you can search for a given key is to search every attribute in every attribute table - something tells me this is going to be slow. Putting all your attributes into a single table with a type field simplifies the searches (one table) then having a meta data table about the attributes could limit your searches making them faster.
    My intent is that a particular attribute will appear only in one -attribute table; perhaps more important, I will probably only search for a particular attribute in one -attribute table.

    I actually don't intend to let site visitors "search the database," but instead they'll search against a site index (a completely separate database; the current database probably won't not even be accessible from the web). Instead, I'll define a fixed set of pages to be generated from my database, and ftp'ed to my web host. I assume that at some point I'll shift so the whole database runs on my web server, and pages are dynamically generated, but that's not my intent at launch.

    Quote Originally Posted by mike_bike_kite View Post
    At the moment you only appear to have a couple of different types of entity and I just can't see why it wouldn't be easier (and more correct) to use a normal 3NF design. I'm not against EAV myself in the right place but I personally think you'll come unstuck.
    You might be right, but every day I discover a new "attribute" that I think might be useful to include in the database, and I am acutely aware that most "attributes" occur very sparsely among the records for a particular entity. Note that I spun off "versions" and "merchants" as separate tables from the "products" table, which also has no EAV (entity-attribute-value) or EERA (entity-entity-relationship-value) table.
    Last edited by MarkWelch; 11-14-10 at 15:30.

  13. #13
    Join Date
    Nov 2010
    Posts
    12
    Attached is an updated database diagram, along with my preliminary list of likely attributes and relationships.
    Attached Thumbnails Attached Thumbnails lp-diagram5.jpg   lp-attributes-and-relationships.jpg  

  14. #14
    Join Date
    Nov 2010
    Posts
    12
    I've made some significant changes to my overall database structure. Most significant, I've tossed out the "category" structure entirely, leaving that for implementation at a later time, while I concentrate for now on implementing a system that works with one category of data.
    Attached Thumbnails Attached Thumbnails li-diagram.jpg  

  15. #15
    Join Date
    Nov 2010
    Posts
    12
    FYI - after a week of experimenting, I'm still satisfied with the database structure shown in the attachment to my prior post.

    I've also been very, very happy with the (free) MySQL Workbench software, which allowed me to diagram the database (including relationships) and then generate SQL code to create the database and tables. After importing some test data into the database, I used phpmyadmin to export that data to a SQL script so I could restore it each time I tweaked the database design. I also spent some time playing with phprunner, which seems quite nice and powerful but doesn't seem like the right solution for my coding needs.

    After spending much of the past week "cleaning data" (using Microsoft Excel), I'm now starting to code the main import script and the display script.

    Many thanks to those who responded!
    Last edited by MarkWelch; 11-23-10 at 23:22.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •