Results 1 to 13 of 13
  1. #1
    Join Date
    Feb 2013
    Posts
    5

    Too many columns? Is there a better way?

    Hi all - new to the forum - new to databases in general. Go easy on me here.

    Not a programmer by trade - I'm an electrical engineer who has been tasked with a daunting data task, and I was hoping maybe someone here would give me a few pointers before I get too far down what is perhaps the wrong road.

    I have the following scenario - we are building a relatively complex piece of industrial equipment that will be sold and delivered to customers around the world. Each of these units consists of various PLC's, Interfaces, etc. that all speak Modbus/TCP. I have a piece of electronic hardware that gathers data from 125 sensors and inputs and runs the machine based on those pieces of data.

    The database challenge comes in here - the powers that be want all 125 pieces of sensor and input data logged to a database once a second, every second, for the past 365 days. I've programmed a communication library that accepts the incoming modbus/TCP connection to an embedded industrial PC mounted in the main control cabinet. I'm running CentOS6/MySQL with a LAMP stack setup on the industrial PC (it's got a 256GB SSD for local storage).

    The data coming in is all signed shortint. I'm using the MySQL Connector C++ to populate the DB.

    My "data" storage plan initially was simply:
    Main table - "LoggedData"
    127 columns -
    the first one is a sequential field (INT)
    the second one is a timestamp of when the entry was made (TIMESTAMP)
    next 125 columns - all shortint - one for each sensor/input in the machine. This is my main concern....

    Also I was going to write a batch job that daily parses that main data table for each entry that has a timestamp that happens at the "00" second (on the minute record) and populate Table 2 that is a bit different:
    Table 2 - "SummaryData"
    12 columns -
    the first one is a sequential key field (INT)
    the second one is the timestamp (TIMESTAMP)
    next 10 columns - 10 selected values out of the 125 that indicate critical pieces of information (i.e. machine is running/stopped, speed at which it's running, temperature of some critical components, input voltage and current, etc).

    The thought there (I know the second table is redundant to the first in some ways) is that writing reporting programs to deal with the first set of data would be cumbersome when we finally get around to hiring our app developers who will be writing reporting and web monitoring interfaces - really the main large data set is only of concern to engineering for machine maintenance and monitoring, while the subset is useful to the machine owner/operators, the dealers, sales folks who want to know how the machines are used, etc...

    Am I on the right road? Should I re-think this??? Suggestions?

  2. #2
    Join Date
    Feb 2004
    Location
    New Zealand
    Posts
    1,424
    What i would do is
    Have the Main Table
    only with 4 feilds

    I would Store the data Down eg each sesenor has its own record


    MainTable
    MaintableID PK
    SesensorName (or could be its IP address)
    LDate (log date time)
    Svalue (the Value of the sensor )

    then you can write Querys to filter the data you want to see
    Last edited by myle; 02-02-13 at 17:50. Reason: spelling
    hope this help

    See clear as mud


    StePhan McKillen
    the aim is store once, not store multiple times
    Remember... Optimize 'til you die!
    Progaming environment:
    Access based on my own environment: DAO3.6/A97/A2000/A2003/A2007/A2010
    VB based on my own environment: vb6 sp5
    ASP based on my own environment: 5.6
    VB-NET based on my own environment started 2007
    SQL-2005 based on my own environment started 2008
    MYLE
    YOUR PASSWORD IS JUST LIKE YOUR TOOTHBRUSH DON'T SHARE IT.

  3. #3
    Join Date
    Feb 2013
    Posts
    5
    Honestly, I did consider the single row per sensor option right when I first started - what I didn't mention before though, is the data that is sent from the main process controller is essentially a single string of 125 short integers, once per second - so my C++ code takes the string of integers and stuffs it into an array of short ints, one per sensor, allowing me to use a single prepared statement and single insert and commit to write the entire whack of data all across one row.

    Basically, I get a single TCP/IP packet with 125 data points (250 bytes), once per second from the main process controller, all with the one time stamp sent along. I could create 125 rows in the table, all with the same time stamp, indicating a different data point each, but I thought adding 450,000 rows of data per hour with 450,000 inserts to my db wouldn't really be any better than 3600 inserts writing to 125 columns when it came to db complexity... Or am I really wrong and that is the way to do it? I can easily re-write my loop that does my insert to do 125 individual inserts rather than 1 big insert... Thoughts?

    I also considered 125 tables, one per sensor, and just FK'ing the data the same across all 125 tables to the main table that contains the time stamp, PK, and any other relevant info... But that would be a frackin' beast to dba.

  4. #4
    Join Date
    Jun 2003
    Location
    Toronto, Canada
    Posts
    5,516
    Quote Originally Posted by Hydraglass View Post
    The thought there (I know the second table is redundant to the first in some ways) is that writing reporting programs to deal with the first set of data would be cumbersome
    Your approach seems perfectly fine to me; there's nothing wrong with having few hundred columns in a table, and in your case the model is warranted by the inputs (single IP packet) and relatively simple reporting requirements.

    Depending on your performance constraints, you may not even need the summary table, since its results are produced by a simple SELECT .. GROUP BY .. statement every time they are needed, so there's no development complexity there. I can imagine it might take a few seconds to execute on embedded hardware, with 30 million rows to process, so if that's an issue then the summary table will help.
    ---
    "It does not work" is not a valid problem statement.

  5. #5
    Join Date
    Feb 2013
    Posts
    5
    Thanks again for the feedback - this was sort of what I suspected, but as my db experience is pretty limited and my google searches kept turning up "if you've got 100 columns you've done it wrong" posts all over the web, I wanted to just run it past an audience of experts that have a thousand times more expertise.

    There's a second reason as well for the summary table - once a day the requirements are to have a certain amount of logging data duplicated back to corporate HQ for integration into a performance monitoring portal website (that is yet to be built). I was thinking if the local hardware did the work of making the summary table (which will also be queried by the HMI on the machine locally to produce some quick simple reports of yesterday's performance), it would be trivial to also mirror that table back to HQ over a VPN, with minimal bandwidth usage for each individual machine. I've run all this past my management, but they know less about db structure than I do

  6. #6
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    In general, a table with 128+ columns is a sign of poor design but in your case it makes perfect sense.

    I would recommend using UTC for your timestamp data for several reasons. First to avoid issues with Daylight Savings Time (which often play hob with industrial controller data).

    I don't see much (if any) need for the summary table. I generally have machinary collect raw data, then either export that raw data in-toto to another system for processing or only summarize the data as part of the extraction process. You obviously know your system far better than I do, but that has been a "time tested" methodology that has worked well for me for decades.

    While I'm certain that having a "database geek" available at the next desk would help with some issues and choices, you appear to have a clear understanding and a good practical approach to databases (and automation). Feel free to explain scenarios and ask questions, but you sure seem to be on the right track to me!

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  7. #7
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    you don't say what DB you are using, but one thing to bear in mind is that all DB's have internal limits. and designing soemthing with 128 columns may well not be a problme, but when you come to start usign queires you may have problems.

    ferinstance Access/JET has an internal limit of around 256 (never more often less) that restriction is (or at least was) also present in queries. ie whatever the limit is it appears in table AND query design. As with all these things you need to design the process rather than let it evolve so there is little or no risk of hittign these problems part way into the project when you have already commiting to code development
    I'd rather be riding on the Tiger 800 or the Norton

  8. #8
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Quote Originally Posted by healdem View Post
    you don't say what DB you are using
    The OP's reference to the LAMP stack implies MySQL.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  9. #9
    Join Date
    Feb 2013
    Posts
    5
    Pat was right - I'm running CentOS6, MySQL, The MySQL C++ Connector, and a Modbus C++ Slave connector library from modbusdriver.com.

    Our interface programmer (when we hire him/her) will be building iPhone app and Android app, and HTML5 web interface to the summary data, that will hopefully include all sorts of nice reports, graphs, and the ability to pull some near real-time data via VPN back to the machinery itself.

    I've finished my initial implementation this weekend, after some consultation with the boss and a few of my associates, I talked them into scaling back a few of the data points we are monitoring just to make it a little more reasonable. My main table is 102 columns - 1st is an auto-incremented unique serial key, the second is a TIMESTAMP using DEFAULT CURRENT_TIMESTAMP (The RTC on the hardware is set to UTC and uses NTP to keep its clock set), and the next 100 are the 100 data points coming in via the Modbus packet array. I've tested it with one of the PLC controllers I brought home for the weekend and I could push the whole 100 data point Modbus packet in repeatedly at 100mS increments and my code and the db kept up no problem, so it should be a breeze at 1 second.

    The hardware it will be running on is a small industrial Core i7 with 16GB Ram and a 256GB SSD - should have plenty of guts no matter how big the table gets. I plan on doing some data aging and archiving to keep the table from exceeding 2GB - I've got to do some math but I think I'll end up keeping something like 30 days of 1 second data in the main table and then keep the older stuff on 1 minute intervals in another table that will keep a whole year's data.

    I'll update everyone how it goes once it's actually running and logging - this was a major stumbling block to getting our prototypes shipped out to our first customers - everyone was pushing very hard for this data logging before allowing the machinery to get turned on.

  10. #10
    Join Date
    Feb 2013
    Posts
    3
    Quote Originally Posted by Hydraglass View Post
    the data that is sent from the main process controller is essentially a single string of 125 short integers
    Why don't you leave the data like that and enter it into a single field in the DB?

    You didn't specify how you wanted this data to be retrieved. Merely that the data was to be 'logged'
    Why not leave the data unparsed and store it as is if you're not concerned with being able to sort when sensor 1 was off/on?
    Last edited by TheOddPerson; 02-13-13 at 11:32.

  11. #11
    Join Date
    Feb 2013
    Posts
    5
    Quote Originally Posted by TheOddPerson View Post
    Why don't you leave the data like that and enter it into a single field in the DB?

    You didn't specify how you wanted this data to be retrieved. Merely that the data was to be 'logged'
    Why not leave the data unparsed and store it as is if you're not concerned with being able to sort when sensor 1 was off/on?
    Moving forward, reporting and pulling out data is going to be very important to build client performance reports. If I just wanted the raw data I wouldn't bother with a db - I'd just dump it into a log file - and then have to write software to parse it anyway. The idea here is to have the strength of SQL language, the reliability and speed of MySQL, the flexibility to report specific data out into reports, and yet keep the response time to the data source (PLC) fast enough to keep its loop comms happy (I have 50mS to ack the received data).

    I've made pretty good progress - I have the local server units running on 3 prototype machines now - 2 of which are offsite and one still here in our engineering design lab. Right now I'm logging 82 pieces of data continuously, 3600 records per hour. Every 12 hours each local server does a structured query SELECT of the data for all entries with time stamps in the last 12 hours , but only one per minute, creates a .csv file, then a shell script takes over, compresses the csv file, uses scp to copy the file back to the main server at hq (usually over a 3G modem connection), and as soon as that file transfer is flagged as complete in a folder (by the arrival of a second "0 byte" file), another job takes over on the server, uncompresses the file, determines what piece of machinery it came from, and imports and appends it onto the HQ main db server on the table for that specific machine.

    The tables are a little bit of a pain at 102 columns, but when I need reports such as "Accumulator B and C Pressure for the last X hours" it's as braindead simple as making a query like:
    SELECT Time,Input_037,Input_038 FROM Unit00002 WHERE (whatever) ORDER BY (Time) DESC LIMIT (X*60);

    For a guy who had never written even one SQL query until 2 weeks ago, being able to keep it that simple is nice.

    I've got the OK to go hire someone now to develop all the apps and reporting tools and portals and good stuff that will go along with the data - hopefully whoever I get doesn't have a heart attack when they see what I've done thus far and they don't see the need to completely re-invent the wheel.

  12. #12
    Join Date
    Jun 2004
    Location
    Arizona, USA
    Posts
    1,848
    No - really, the only logical choices are a 'wide' table as you've done it, or a parent/child arrangement, where the parent table contains the timestamp and ID, and the child table is "narrow" with an ID relating to the parent record, an instrument id, and the data value.

    The latter approach can be a more flexible for reporting, but it does require storing an extra Foreign Key field per instrument reading. (But, if there isn't an instrument reading for a given timestamp, there isn't data stored for that timestamp...)

    In the latter case, the equivalent SQL would be along the lines of
    Code:
    SELECT P.Time,C1.MV C2.MV FROM ParentRec P
      Left Join ChildRecs C1 
      ON P.ID = C1.ID and C1.InputID = 37
    
      Left Join ChildRecs C2 
      On P.ID = C2.ID and C2.InputID = 38
    
    WHERE (whatever) 
    ORDER BY (Time) DESC 
    LIMIT (X*60);
    Last edited by loquin; 02-13-13 at 18:45.
    Lou
    使大吃一惊
    "Lisa, in this house, we obey the laws of thermodynamics!" - Homer Simpson
    "I have my standards. They may be low, but I have them!" - Bette Middler
    "It's a book about a Spanish guy named Manual. You should read it." - Dilbert


  13. #13
    Join Date
    Jun 2004
    Location
    Arizona, USA
    Posts
    1,848
    When time comes for archiving the historical data, you might want to think about a "reverse boxcar" algorithm. Essentially, your first two stored points (timestamp and value,) are used to calculate a slope, and than as long as the next measured value falls on the slope you've calculated (within a tolerance you specify) no further points are needed. Only when the data falls outside the tolerance band from the calculated value is another point stored, and the new slope calculated.

    A DCS (Distributed Control System) I used to work with used this approach: we routinely got about a 50:1 compression ratio, and still kept meaningful historical data.
    Lou
    使大吃一惊
    "Lisa, in this house, we obey the laws of thermodynamics!" - Homer Simpson
    "I have my standards. They may be low, but I have them!" - Bette Middler
    "It's a book about a Spanish guy named Manual. You should read it." - Dilbert


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •