Page 1 of 2 12 LastLast
Results 1 to 15 of 16
  1. #1
    Join Date
    Mar 2004
    Location
    Eindhoven, Netherlands
    Posts
    8

    Unanswered: Performance issue retrieving latest data

    Hi,


    I've got this table:
    CREATE TABLE Telemetry (
    TagID INT NOT NULL,
    DataType INT NOT NULL REFERENCES DataTypes,
    SeqNr INT AUTO_INCREMENT,
    Value FLOAT,
    DateIn DATETIME NOT NULL,
    PRIMARY KEY (TagId, DataType, SeqNr)
    );

    It contains telemetry data from tags that can deliver multiple kinds of data. Each time new data is received, the data is stored in this table.

    What I want is pretty simple: from all telemetry data series (tagid/datatype tuples) I want to see the latest measurement. I've succeeded using a single query but at a terrible speed. Not fast enough for user-interactivity. The performance really drops as the series get larger.

    I've tried LEFT JOIN queries in MySQL 4.0 and sub-select queries in MySQL 4.1-alpha (can't wait for it to be stable!)

    The sub-select is so much easier to read than the JOIN queries...:
    SELECT t1.tagid, t1.datatype, t1.value
    FROM Telemetry t1
    WHERE t1.seqnr = (
    SELECT MAX(t2.seqnr)
    FROM Telemetry t2
    WHERE t1.tagid=t2.tagid
    AND t1.datatype=t2.datatype
    );

    Is there a way to optimize for performance? Right now I'm using a workaround by iterating over all tuples "manually" and each time executing a query that sorts on the SeqNr descending and limiting the output to 1 row. To me, this seems ugly but it works.

    However I've got more complex queries coming up where I've got to do combine multiple of these series. Doing multiple queries "manually" would result in a lot of queries (not linear with the number of series that must be combined). So really need a better solution! Help!

    As a human, it's easy to see that the primary key can be used to find the entries that I'm looking for. Do a 'group by tagid, datatype' and take the one with the highest seqnr. Using that intelligence would make the query's performance linear with the number of series. Is there a way to make MySQL see things the way I do?


    Thanx!
    Mark

    PS. This seems to me as a very common problem that other people must have dealt with. Are there something like design/implementation patterns for databases?

  2. #2
    Join Date
    Oct 2003
    Location
    Slovakia
    Posts
    482

    Re: Performance issue retrieving latest data

    Originally posted by markoez
    Hi,


    I've got this table:
    CREATE TABLE Telemetry (
    TagID INT NOT NULL,
    DataType INT NOT NULL REFERENCES DataTypes,
    SeqNr INT AUTO_INCREMENT,
    Value FLOAT,
    DateIn DATETIME NOT NULL,
    PRIMARY KEY (TagId, DataType, SeqNr)
    );

    It contains telemetry data from tags that can deliver multiple kinds of data. Each time new data is received, the data is stored in this table.

    What I want is pretty simple: from all telemetry data series (tagid/datatype tuples) I want to see the latest measurement. I've succeeded using a single query but at a terrible speed. Not fast enough for user-interactivity. The performance really drops as the series get larger.

    I've tried LEFT JOIN queries in MySQL 4.0 and sub-select queries in MySQL 4.1-alpha (can't wait for it to be stable!)

    The sub-select is so much easier to read than the JOIN queries...:
    SELECT t1.tagid, t1.datatype, t1.value
    FROM Telemetry t1
    WHERE t1.seqnr = (
    SELECT MAX(t2.seqnr)
    FROM Telemetry t2
    WHERE t1.tagid=t2.tagid
    AND t1.datatype=t2.datatype
    );

    Is there a way to optimize for performance? Right now I'm using a workaround by iterating over all tuples "manually" and each time executing a query that sorts on the SeqNr descending and limiting the output to 1 row. To me, this seems ugly but it works.

    However I've got more complex queries coming up where I've got to do combine multiple of these series. Doing multiple queries "manually" would result in a lot of queries (not linear with the number of series that must be combined). So really need a better solution! Help!

    As a human, it's easy to see that the primary key can be used to find the entries that I'm looking for. Do a 'group by tagid, datatype' and take the one with the highest seqnr. Using that intelligence would make the query's performance linear with the number of series. Is there a way to make MySQL see things the way I do?


    Thanx!
    Mark

    PS. This seems to me as a very common problem that other people must have dealt with. Are there something like design/implementation patterns for databases?
    "The sub-select is so much easier to read than the JOIN queries...:" - Yes but also could be slower...
    What the explain plan says?
    Last edited by ika; 03-18-04 at 17:52.

  3. #3
    Join Date
    Mar 2004
    Location
    Eindhoven, Netherlands
    Posts
    8

    Re: Performance issue retrieving latest data

    Originally posted by ika
    "The sub-select is so much easier to read than the JOIN queries...:" - Yes but also could be slower...
    What the explain plan says?
    I'm sorry, I don't understand your question. What do you mean with "explain plan"?

  4. #4
    Join Date
    Oct 2003
    Location
    Slovakia
    Posts
    482

    Re: Performance issue retrieving latest data

    Originally posted by markoez
    I'm sorry, I don't understand your question. What do you mean with "explain plan"?
    Look at: http://www.mysql.com/doc/en/EXPLAIN.html
    also i recomend to you look at: http://www.mysql.com/doc/en/Query_Speed.html

  5. #5
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    This may be a silly question, but do you have an index on the received column, or better yet (id, received)?

    -PatP

  6. #6
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002

    Re: Performance issue retrieving latest data

    Originally posted by markoez
    As a human, it's easy to see that the primary key can be used to find the entries that I'm looking for. Do a 'group by tagid, datatype' and take the one with the highest seqnr. Using that intelligence would make the query's performance linear with the number of series. Is there a way to make MySQL see things the way I do?
    yes, the "group by" you describe is exactly how you solve this problem in the join approach
    PHP Code:
    select t1.TagID 
         
    t1.DataType
         
    t1.Value
      from Telemetry t1
    inner
      join Telemetry t2
        on t1
    .TagID    t2.TagID
       
    and t1.DataType t2.DataType
    group
        by t1
    .TagID
         
    t1.DataType
         
    t1.Value
    having t1
    .SeqNbr max(t2.SeqNbr
    once you see how this works, you will immediately recognize that it is the same as the subquery

    now, whether you find the join approach or the subquery approach easier to understand, is entirely up to you...

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  7. #7
    Join Date
    Mar 2004
    Location
    Eindhoven, Netherlands
    Posts
    8

    Talking Re: Performance issue retrieving latest data

    Thanks everybody and especially r937. This is really the query I was looking for. I had found the GROUP BY part myself but didn't know how to extract the data from the table. Obviously, the JOIN and HAVING parts do the trick.

    And Ika, I'll definitely look at the articles you mentioned.


    Thanks all!

  8. #8
    Join Date
    Mar 2004
    Location
    Eindhoven, Netherlands
    Posts
    8

    Re: Performance issue retrieving latest data

    Looking at the query I was convinced it was going to work. But I was too optimistic. I get errors I don't know how to solve. I had to rework the query somewhat (at least I thought I had to). My query:
    PHP Code:
    select t2.TagID 
         
    t2.DataType
         
    t2.Value
      from Telemetry t1
    inner
      join Telemetry t2
        on t1
    .TagID    t2.TagID
       
    and t1.DataType t2.DataType
    group
        by t1
    .TagID
         
    t1.DataType
    having max
    (t1.SeqNr) = t2.SeqNr 
    Running this query I get: 'ERROR 1054: Unknown column 't2.SeqNr' in 'having clause''. And yes, there really is a column named SeqNr. When I change the HAVING clause into 'where t2.SeqNr = max(t1.SeqNr)' and put it before the 'group by' then I get the error 'ERROR 1111: Invalid use of group function'...

    Help! How do I link tables t1 and t2 together using max(t1.SeqNr)?

  9. #9
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Code:
    ...
    group
        by t1.TagID
         , t1.DataType
         , t1.Value
    having t1.SeqNr = max(t2.SeqNr)
    first, the GROUP BY must contain all the columns of the SELECT

    second, since you are grouping by t1 columns, you must apply the aggegate functions only to t2 columns
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  10. #10
    Join Date
    Mar 2004
    Location
    Eindhoven, Netherlands
    Posts
    8
    first, the GROUP BY must contain all the columns of the SELECT
    I don't understand that. A telemetry series consists of telemetry values that are received from a certain tag with a certain datatype. Each measurement is made unique by the sequence number. I don't understand why the column value must be used in the group. For instance, if a telemetry series follows the temperature of your fridge it could follow a sinus-like pattern between 4 degrees Celcius and 7 degrees Celcius. The group by part should result in the latest temperature that was measured (the highest SeqNr of that series) and not in the highest sequence numbers where the temperature was last 4 degrees, 5 degrees, 6 degrees and 7 degrees. As I see it, that's what adding the Value culumn to the group by would do.

    second, since you are grouping by t1 columns, you must apply the aggegate functions only to t2 columns
    That was a typo. I definitely tried 'having t2.SeqNr = max(t1.SeqNr)'. It resulted in the same 'ERROR 1054: Unknown column 't2.SeqNr' in 'having clause'' error. Could it be that t2 may not be used in the HAVING clause at all? Can this be MySQL specific?

    I do think I understand how GROUP BY/HAVING can be used but I've only seen very simple use of it. Do you know a web-resource where I can read up on more advance use of this functionality?

    Maybe I don't see the strategy of your query correctly. This is how I see it. First, using the GROUP BY and MAX functionality we retrieve the highest sequence numbers of all telemetry series. Second, the obtained result is JOINed with another instance of the table to retrieve the telemetry values.

  11. #11
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    in a GROUP BY query, the SELECT list may contain column expressions and/or aggregate expressions

    the GROUP BY must include all non-aggregate columns in the SELECT list

    mysql does not enforce this but let's not go there for now,m that's a different thread and you are not ready for it

    so if you have SELECT A,B,SUM(C) then you must GROUP BY A,B

    in your case, the query i gave you joins each row of the table to every other row with the same TagID and DataType

    the joining operation is actually performed before the grouping

    in order to keep column names separate, you need aliases

    it may help to visualize this as two "copies" of the table being joined, a "t1" copy and a "t2" copy

    for each grouping (i.e. each row of the t1 copy of the table), the highest SeqNbr of all the rows from the t2 copy is selected in the HAVING clause

    note you can only apply aggregate functions in the HAVING clause to the t2 columns

    helps?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  12. #12
    Join Date
    Mar 2004
    Location
    Eindhoven, Netherlands
    Posts
    8

    Talking

    helps?
    Yes, it does! Thank you very much. I think I've got a much better understanding of GROUP BY queries now and the way they can be used. I very much appriciate your time and patience with me.

    But I'm not where I want to be yet. Your query was near perfect. As I mentioned earlier, your query returned the error that t1.SeqNr was not known in the HAVING clause.

    With my renewed insight I tried to fix the problem. I think I got rid of the error by adding t1.SeqNr to the SELECT and GROUP BY clause:
    Code:
    select t1.TagID 
         , t1.DataType
         , t1.SeqNr
         , t1.Value
      from Telemetry t1
    inner
      join Telemetry t2
        on t1.TagID    = t2.TagID
       and t1.DataType = t2.DataType
    group
        by t1.TagID
         , t1.DataType
         , t1.Value
         , t1.SeqNr
    having t1.SeqNr = MAX(t2.SeqNr)
    This query seems to give me the answers I want (although I don't need the t1.SeqNr culumn but alas). However the reason I started this thread was that I had a performance problem and with this query I still do (having only an index on (TagId, Datatype, SeqNr) which should do the trick).

    I've got a test table with 2 tags with each 2 datatypes with each a 1000 values (and sequence numbers of course). Running the query above takes about 22 seconds. Really

    Doing this query:
    Code:
    select t1.TagID, t1.DataType, max(SeqNr)
    from Telemetry t1
    group by t1.TagID, t1.DataType
    takes about 0.02 seconds. It gives the key to the values I need to retrieve. Creating another query that uses these results and returns the values should take about the same amount of time. I had hoped that it would be possible to create a single query that would do the same things in the same amount of time.

    Do you still think it is possible to get that kind of performance with a single query?

  13. #13
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    you are right, SeqNbr did belong in both the SELECT and GROUP BY

    see, that's why i always advise people to test the sql i come up with, because i can't



    as far as performance goes, hey, if subqueries were allowed, you could time them and see

    there are several ways to "work around" not being able to write subqueries

    your simple query, the one that takes 0.02 seconds, all you need to do is join those results back to the same table, right? that's exactly how a subquery works under the covers, no?

    okay, so one workaround is to store those results as a temporary table, and run a second query

    just as you suggested

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  14. #14
    Join Date
    Mar 2004
    Location
    Eindhoven, Netherlands
    Posts
    8
    your simple query, the one that takes 0.02 seconds, all you need to do is join those results back to the same table, right?
    True.
    that's exactly how a subquery works under the covers, no?
    You would expect that wouldn't you? Well, in the current MySQL 4.1-alpha this is not the case. At the start of the thread I gave my subquery-query. It is has the same performance issues... But maybe the optimization engine isn't optimal yet. The way to do something like this with a sub-query would be doing something like 'someIdField IN (SELECT ...)'. However when the key is not a single column, I don't know of a way to match multiple columns from the result of a single subquery (in my case TagId, DataType and SeqNr).

    okay, so one workaround is to store those results as a temporary table, and run a second query

    just as you suggested

    Always nice to see a guru agree with you. But really, since I'm quite the novice (as far as SQL goes) and the problem seems to trivial. I'ld expect that thousands of people to have ran into this problem before.

    I want to show you a totally different solution somebody else suggested to me. It's not a generic solution and I can't use it in this manner since my Value is a float but I guess it can be used in a couple of cases:
    Code:
    SELECT TagID, DataType, MAX(SeqNr), MAX(1000000*SeqNr+Value)%1000000 
    FROM Telemetry 
    GROUP BY TagID, DataType

  15. #15
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    oh that's just ugly

    (there's a "max concat" trick documented on the mysql site, maybe this came from there)

    as far as the subselect syntax goes, you said 'someIdField IN (SELECT ...)' but that's not the only way to do it

    you could use a correlated subquery, which would allow you to join on multiple columns
    PHP Code:
    select TagID 
         
    DataType
         
    SeqNr
         
    Value
      from Telemetry t1
     where SeqNr 

           ( 
    select max(SeqNr)
               
    from Telemetry 
              where TagID 
    t1.TagID
                
    and DataType t1.DataType 
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •