If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > MySQL > Performance issue retrieving latest data

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 03-18-04, 12:32
markoez markoez is offline
Registered User
 
Join Date: Mar 2004
Location: Eindhoven, Netherlands
Posts: 8
Performance issue retrieving latest data

Hi,


I've got this table:
CREATE TABLE Telemetry (
TagID INT NOT NULL,
DataType INT NOT NULL REFERENCES DataTypes,
SeqNr INT AUTO_INCREMENT,
Value FLOAT,
DateIn DATETIME NOT NULL,
PRIMARY KEY (TagId, DataType, SeqNr)
);

It contains telemetry data from tags that can deliver multiple kinds of data. Each time new data is received, the data is stored in this table.

What I want is pretty simple: from all telemetry data series (tagid/datatype tuples) I want to see the latest measurement. I've succeeded using a single query but at a terrible speed. Not fast enough for user-interactivity. The performance really drops as the series get larger.

I've tried LEFT JOIN queries in MySQL 4.0 and sub-select queries in MySQL 4.1-alpha (can't wait for it to be stable!)

The sub-select is so much easier to read than the JOIN queries...:
SELECT t1.tagid, t1.datatype, t1.value
FROM Telemetry t1
WHERE t1.seqnr = (
SELECT MAX(t2.seqnr)
FROM Telemetry t2
WHERE t1.tagid=t2.tagid
AND t1.datatype=t2.datatype
);

Is there a way to optimize for performance? Right now I'm using a workaround by iterating over all tuples "manually" and each time executing a query that sorts on the SeqNr descending and limiting the output to 1 row. To me, this seems ugly but it works.

However I've got more complex queries coming up where I've got to do combine multiple of these series. Doing multiple queries "manually" would result in a lot of queries (not linear with the number of series that must be combined). So really need a better solution! Help!

As a human, it's easy to see that the primary key can be used to find the entries that I'm looking for. Do a 'group by tagid, datatype' and take the one with the highest seqnr. Using that intelligence would make the query's performance linear with the number of series. Is there a way to make MySQL see things the way I do?


Thanx!
Mark

PS. This seems to me as a very common problem that other people must have dealt with. Are there something like design/implementation patterns for databases?
Reply With Quote
  #2 (permalink)  
Old 03-18-04, 16:50
ika ika is offline
Registered User
 
Join Date: Oct 2003
Location: Slovakia
Posts: 482
Re: Performance issue retrieving latest data

Quote:
Originally posted by markoez
Hi,


I've got this table:
CREATE TABLE Telemetry (
TagID INT NOT NULL,
DataType INT NOT NULL REFERENCES DataTypes,
SeqNr INT AUTO_INCREMENT,
Value FLOAT,
DateIn DATETIME NOT NULL,
PRIMARY KEY (TagId, DataType, SeqNr)
);

It contains telemetry data from tags that can deliver multiple kinds of data. Each time new data is received, the data is stored in this table.

What I want is pretty simple: from all telemetry data series (tagid/datatype tuples) I want to see the latest measurement. I've succeeded using a single query but at a terrible speed. Not fast enough for user-interactivity. The performance really drops as the series get larger.

I've tried LEFT JOIN queries in MySQL 4.0 and sub-select queries in MySQL 4.1-alpha (can't wait for it to be stable!)

The sub-select is so much easier to read than the JOIN queries...:
SELECT t1.tagid, t1.datatype, t1.value
FROM Telemetry t1
WHERE t1.seqnr = (
SELECT MAX(t2.seqnr)
FROM Telemetry t2
WHERE t1.tagid=t2.tagid
AND t1.datatype=t2.datatype
);

Is there a way to optimize for performance? Right now I'm using a workaround by iterating over all tuples "manually" and each time executing a query that sorts on the SeqNr descending and limiting the output to 1 row. To me, this seems ugly but it works.

However I've got more complex queries coming up where I've got to do combine multiple of these series. Doing multiple queries "manually" would result in a lot of queries (not linear with the number of series that must be combined). So really need a better solution! Help!

As a human, it's easy to see that the primary key can be used to find the entries that I'm looking for. Do a 'group by tagid, datatype' and take the one with the highest seqnr. Using that intelligence would make the query's performance linear with the number of series. Is there a way to make MySQL see things the way I do?


Thanx!
Mark

PS. This seems to me as a very common problem that other people must have dealt with. Are there something like design/implementation patterns for databases?
"The sub-select is so much easier to read than the JOIN queries...:" - Yes but also could be slower...
What the explain plan says?

Last edited by ika; 03-18-04 at 16:52.
Reply With Quote
  #3 (permalink)  
Old 03-19-04, 03:31
markoez markoez is offline
Registered User
 
Join Date: Mar 2004
Location: Eindhoven, Netherlands
Posts: 8
Re: Performance issue retrieving latest data

Quote:
Originally posted by ika
"The sub-select is so much easier to read than the JOIN queries...:" - Yes but also could be slower...
What the explain plan says?
I'm sorry, I don't understand your question. What do you mean with "explain plan"?
Reply With Quote
  #4 (permalink)  
Old 03-19-04, 14:14
ika ika is offline
Registered User
 
Join Date: Oct 2003
Location: Slovakia
Posts: 482
Re: Performance issue retrieving latest data

Quote:
Originally posted by markoez
I'm sorry, I don't understand your question. What do you mean with "explain plan"?
Look at: http://www.mysql.com/doc/en/EXPLAIN.html
also i recomend to you look at: http://www.mysql.com/doc/en/Query_Speed.html
Reply With Quote
  #5 (permalink)  
Old 03-19-04, 14:50
Pat Phelan Pat Phelan is offline
Resident Curmudgeon
 
Join Date: Feb 2004
Location: In front of the computer
Posts: 12,606
This may be a silly question, but do you have an index on the received column, or better yet (id, received)?

-PatP
Reply With Quote
  #6 (permalink)  
Old 03-20-04, 08:08
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
Re: Performance issue retrieving latest data

Quote:
Originally posted by markoez
As a human, it's easy to see that the primary key can be used to find the entries that I'm looking for. Do a 'group by tagid, datatype' and take the one with the highest seqnr. Using that intelligence would make the query's performance linear with the number of series. Is there a way to make MySQL see things the way I do?
yes, the "group by" you describe is exactly how you solve this problem in the join approach
PHP Code:
select t1.TagID 
     
t1.DataType
     
t1.Value
  from Telemetry t1
inner
  join Telemetry t2
    on t1
.TagID    t2.TagID
   
and t1.DataType t2.DataType
group
    by t1
.TagID
     
t1.DataType
     
t1.Value
having t1
.SeqNbr max(t2.SeqNbr
once you see how this works, you will immediately recognize that it is the same as the subquery

now, whether you find the join approach or the subquery approach easier to understand, is entirely up to you...

__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #7 (permalink)  
Old 03-22-04, 03:39
markoez markoez is offline
Registered User
 
Join Date: Mar 2004
Location: Eindhoven, Netherlands
Posts: 8
Talking Re: Performance issue retrieving latest data

Thanks everybody and especially r937. This is really the query I was looking for. I had found the GROUP BY part myself but didn't know how to extract the data from the table. Obviously, the JOIN and HAVING parts do the trick.

And Ika, I'll definitely look at the articles you mentioned.


Thanks all!
Reply With Quote
  #8 (permalink)  
Old 03-22-04, 05:36
markoez markoez is offline
Registered User
 
Join Date: Mar 2004
Location: Eindhoven, Netherlands
Posts: 8
Re: Performance issue retrieving latest data

Looking at the query I was convinced it was going to work. But I was too optimistic. I get errors I don't know how to solve. I had to rework the query somewhat (at least I thought I had to). My query:
PHP Code:
select t2.TagID 
     
t2.DataType
     
t2.Value
  from Telemetry t1
inner
  join Telemetry t2
    on t1
.TagID    t2.TagID
   
and t1.DataType t2.DataType
group
    by t1
.TagID
     
t1.DataType
having max
(t1.SeqNr) = t2.SeqNr 
Running this query I get: 'ERROR 1054: Unknown column 't2.SeqNr' in 'having clause''. And yes, there really is a column named SeqNr. When I change the HAVING clause into 'where t2.SeqNr = max(t1.SeqNr)' and put it before the 'group by' then I get the error 'ERROR 1111: Invalid use of group function'...

Help! How do I link tables t1 and t2 together using max(t1.SeqNr)?
Reply With Quote
  #9 (permalink)  
Old 03-22-04, 06:22
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
Code:
...
group
    by t1.TagID
     , t1.DataType
     , t1.Value
having t1.SeqNr = max(t2.SeqNr)
first, the GROUP BY must contain all the columns of the SELECT

second, since you are grouping by t1 columns, you must apply the aggegate functions only to t2 columns
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #10 (permalink)  
Old 03-22-04, 07:49
markoez markoez is offline
Registered User
 
Join Date: Mar 2004
Location: Eindhoven, Netherlands
Posts: 8
Quote:
first, the GROUP BY must contain all the columns of the SELECT
I don't understand that. A telemetry series consists of telemetry values that are received from a certain tag with a certain datatype. Each measurement is made unique by the sequence number. I don't understand why the column value must be used in the group. For instance, if a telemetry series follows the temperature of your fridge it could follow a sinus-like pattern between 4 degrees Celcius and 7 degrees Celcius. The group by part should result in the latest temperature that was measured (the highest SeqNr of that series) and not in the highest sequence numbers where the temperature was last 4 degrees, 5 degrees, 6 degrees and 7 degrees. As I see it, that's what adding the Value culumn to the group by would do.

Quote:
second, since you are grouping by t1 columns, you must apply the aggegate functions only to t2 columns
That was a typo. I definitely tried 'having t2.SeqNr = max(t1.SeqNr)'. It resulted in the same 'ERROR 1054: Unknown column 't2.SeqNr' in 'having clause'' error. Could it be that t2 may not be used in the HAVING clause at all? Can this be MySQL specific?

I do think I understand how GROUP BY/HAVING can be used but I've only seen very simple use of it. Do you know a web-resource where I can read up on more advance use of this functionality?

Maybe I don't see the strategy of your query correctly. This is how I see it. First, using the GROUP BY and MAX functionality we retrieve the highest sequence numbers of all telemetry series. Second, the obtained result is JOINed with another instance of the table to retrieve the telemetry values.
Reply With Quote
  #11 (permalink)  
Old 03-22-04, 08:24
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
in a GROUP BY query, the SELECT list may contain column expressions and/or aggregate expressions

the GROUP BY must include all non-aggregate columns in the SELECT list

mysql does not enforce this but let's not go there for now,m that's a different thread and you are not ready for it

so if you have SELECT A,B,SUM(C) then you must GROUP BY A,B

in your case, the query i gave you joins each row of the table to every other row with the same TagID and DataType

the joining operation is actually performed before the grouping

in order to keep column names separate, you need aliases

it may help to visualize this as two "copies" of the table being joined, a "t1" copy and a "t2" copy

for each grouping (i.e. each row of the t1 copy of the table), the highest SeqNbr of all the rows from the t2 copy is selected in the HAVING clause

note you can only apply aggregate functions in the HAVING clause to the t2 columns

helps?
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #12 (permalink)  
Old 03-22-04, 10:18
markoez markoez is offline
Registered User
 
Join Date: Mar 2004
Location: Eindhoven, Netherlands
Posts: 8
Talking

Quote:
helps?
Yes, it does! Thank you very much. I think I've got a much better understanding of GROUP BY queries now and the way they can be used. I very much appriciate your time and patience with me.

But I'm not where I want to be yet. Your query was near perfect. As I mentioned earlier, your query returned the error that t1.SeqNr was not known in the HAVING clause.

With my renewed insight I tried to fix the problem. I think I got rid of the error by adding t1.SeqNr to the SELECT and GROUP BY clause:
Code:
select t1.TagID 
     , t1.DataType
     , t1.SeqNr
     , t1.Value
  from Telemetry t1
inner
  join Telemetry t2
    on t1.TagID    = t2.TagID
   and t1.DataType = t2.DataType
group
    by t1.TagID
     , t1.DataType
     , t1.Value
     , t1.SeqNr
having t1.SeqNr = MAX(t2.SeqNr)
This query seems to give me the answers I want (although I don't need the t1.SeqNr culumn but alas). However the reason I started this thread was that I had a performance problem and with this query I still do (having only an index on (TagId, Datatype, SeqNr) which should do the trick).

I've got a test table with 2 tags with each 2 datatypes with each a 1000 values (and sequence numbers of course). Running the query above takes about 22 seconds. Really

Doing this query:
Code:
select t1.TagID, t1.DataType, max(SeqNr)
from Telemetry t1
group by t1.TagID, t1.DataType
takes about 0.02 seconds. It gives the key to the values I need to retrieve. Creating another query that uses these results and returns the values should take about the same amount of time. I had hoped that it would be possible to create a single query that would do the same things in the same amount of time.

Do you still think it is possible to get that kind of performance with a single query?
Reply With Quote
  #13 (permalink)  
Old 03-22-04, 11:05
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
you are right, SeqNbr did belong in both the SELECT and GROUP BY

see, that's why i always advise people to test the sql i come up with, because i can't



as far as performance goes, hey, if subqueries were allowed, you could time them and see

there are several ways to "work around" not being able to write subqueries

your simple query, the one that takes 0.02 seconds, all you need to do is join those results back to the same table, right? that's exactly how a subquery works under the covers, no?

okay, so one workaround is to store those results as a temporary table, and run a second query

just as you suggested

__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #14 (permalink)  
Old 03-22-04, 11:30
markoez markoez is offline
Registered User
 
Join Date: Mar 2004
Location: Eindhoven, Netherlands
Posts: 8
Quote:
your simple query, the one that takes 0.02 seconds, all you need to do is join those results back to the same table, right?
True.
Quote:
that's exactly how a subquery works under the covers, no?
You would expect that wouldn't you? Well, in the current MySQL 4.1-alpha this is not the case. At the start of the thread I gave my subquery-query. It is has the same performance issues... But maybe the optimization engine isn't optimal yet. The way to do something like this with a sub-query would be doing something like 'someIdField IN (SELECT ...)'. However when the key is not a single column, I don't know of a way to match multiple columns from the result of a single subquery (in my case TagId, DataType and SeqNr).

Quote:
okay, so one workaround is to store those results as a temporary table, and run a second query

just as you suggested

Always nice to see a guru agree with you. But really, since I'm quite the novice (as far as SQL goes) and the problem seems to trivial. I'ld expect that thousands of people to have ran into this problem before.

I want to show you a totally different solution somebody else suggested to me. It's not a generic solution and I can't use it in this manner since my Value is a float but I guess it can be used in a couple of cases:
Code:
SELECT TagID, DataType, MAX(SeqNr), MAX(1000000*SeqNr+Value)%1000000 
FROM Telemetry 
GROUP BY TagID, DataType
Reply With Quote
  #15 (permalink)  
Old 03-22-04, 11:36
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
oh that's just ugly

(there's a "max concat" trick documented on the mysql site, maybe this came from there)

as far as the subselect syntax goes, you said 'someIdField IN (SELECT ...)' but that's not the only way to do it

you could use a correlated subquery, which would allow you to join on multiple columns
PHP Code:
select TagID 
     
DataType
     
SeqNr
     
Value
  from Telemetry t1
 where SeqNr 

       ( 
select max(SeqNr)
           
from Telemetry 
          where TagID 
t1.TagID
            
and DataType t1.DataType 
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On