My Database setup with huge amounts of time-series data. Experience? Suggestions?
** note To start out, let me just say that when it comes to computers I'm completely self taught. I stated learning web development and database about 2 months ago. most of my experiance is with linux admin and general programming
I am an electrician by trade and am in charge of(doing it myself) building on an online metering system that consists of a device you install next to your main power panel, and a web portal where you can view your data. Inside the device is a small linux based computer that handles all the data logging local storage(sqlite or mysql) and real-time analytic's. Once the computer has finished its calculations it uploads the meter data to a cloud based database (mysql on digital ocean) where the data can be viewed from a webapp(that i'm also building).
The device is currently uploading 122 columns of floats* with an incrementing id(which is probably not needed) and a datetime. This is uploaded every 2 seconds for a total of 43,200 rows per day. (yes I need it that frequently, I need some of it at .125 second intervals or as fast as i can get it.)
The device is also uploading calculated aggregations(sums, min, max, average, and some trig and integrals) for time periods of 60s (1 min) 300s and (5 min) 900s all of which are calculated off of the 2 second data plus additional data stored in buffers on the device.
additional tables with event data, 1 row/record per event, and daily statistics(single row per day)
*(values range from 0 to 1000000 rounded to 5th decimal place .00001)
example database layout per meter:
meter_data...................<- database name
----meter1_2ss..............<- 2 second data table with 124 columns
----meter1_60ss............<- 60 second data table with 210 columns
----meter1_300ss..........<- 300 second data table with 210 columns
----meter1_900ss..........<- 900 second data table with 210 columns
----meter1_events.........<- events table with haven't designed yet
----meter1_daily............<- daily statistics table haven't implemented yet
----meter2_2ss..............<- 2 second data for another device
----meter2_60ss............<- 60... every meter has the same tables
I'm using Innodb with barracuda and file_per_table=on
with this setup 1 months worth of data comes in at just over 1.3 million rows and a file size (with barracuda compression enabled) of about 200 mb.
Just looking at the above I noticed that as a minimum I could merge the 60, 300, and 900 second tables together and add an indexed time period identifier.
For reference I leaned everything I know about databases on this project. I learn best by just jumping right in and then having to fix whatever i screw up(note this only applies to my own projects...) I don't expect anyone here to teach me the nitty gritty details of schema/database planning or design. Thats on my personal to-learn list (although if you have suggested learning material or links i'm grateful ) When i designed this I knew just about nothing about database's in general, and now I know enough to know that this is probably a bad setup and that when I can afford it I should hire someone to help me design a better system. But until then(could be awhile) I'm stuck doing this by myself And I need to come up with a design/method... that will scale. One of the goals I have is to store the data for as long as possible, and as a minimum 2 years. As of right now I have 2 devices set up and working, I'm in the process of adding 10 more right now. Our company specializes in designing power quality improving equipment and we are going to be installing one of these devices with every system we sell; so I'll probably have 100 more by the end of the year.
Tools i'm using
Python SQL alchemy (just for table creation for now, ORM's are just as hard to learn as SQL, especially when you don't even know much SQL)
official python MySQL connector with compressed protocol and prepared statements
My Concerns(Ya, I know you probably have more.)
For now with 2-10 devices this works, well what about when I have more, 100's 1000's
As it is setup now with 100 devices the database will be between 150GB and 240GB within a year. Were also going to start selling this device standalone, so that number could shoot up fast. Right now Its setup on a cloud database that purges the 2 seconds after 8 days, and replicated to local database i set up at my office for long term storage, That will only work till the replication maxes out our internet connection. Whats the best cost effective way to try and host this?
Is MySQL the right database for this? Would something like Postgres or Mongo be better?
At 100 devices that will have the database processing 4.43 million inserts a day. Can a database handle that many inserts? If not is there a technology that would?
Ive come across schema designs where all the time-series data was placed in a single table with device identifiers. Would than work for my 2 second data? is it advisable? (http://stackoverflow.com/questions/4...ational-or-non)
I neither need nor expect anyone to post a 5000 word explanation on how best to do anything, If you want to take the time to do that, that's fine and thank you. I'm looking for basic advice, and community experience, as it's pretty obvious I don't have any. I'll be grate-full for whatever I get, even if I have to google for a week to understand it.
It looks like what you're really creating is one table of info so the database aspects shouldn't be difficult. You might want to create a second table, being a list of devices, but that would still be simple as far as SQL.
I don't know about nosql databases (mongodb) but Mongdb may be a good choice since there's not any complicated querying involved. Between PostgreSQL and MySQL I would definitely go with PostgreSQL. From what I understand MySQL is not very ACID compliant which means, especially with such frequent updates, you might have a lot of problems with data consistency.
Thanks for the reply, looking back at my question, I was pretty desperate for information and ended up typing way more than I needed to.
I Ended up using Maria-db and creating 3 tables. One for the time-series data, another for devices and system layouts, and lastly one for configuration and settings per meter. Seems to be working pretty well. I figured getting it up and running would help teach me enough to know what changes I need to make. I'll look at switching it over to use PostgreSQL or whatever else I might find on more of an as needed basis.
One thing that worries me a little is multiple large queries on the single data table. Won't this degrade performance or block other queries? This database is accessible with a website, and once its in production, many users will be querying against it simultaneously.
If this is the case would it be better to have a single table per company? (each company can have as many meters as they want, but all data goes to their single table)
I don't know much at all about performance tuning there are ways of writing your queries to put some of the memory load on the client instead of the server. Users querying the table generally will not block other users' access to the table. They will only lock the row(s) they're accessing. Check the Mariadb documentation on that though.