If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > MySQL > Terrible performance in a join ?

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 07-18-09, 17:37
therunner therunner is offline
Registered User
 
Join Date: Jul 2009
Posts: 6
Terrible performance in a join ?

I have a system that logs different data from 2 different type of devices (we'll call them A and B). A logs data every 5 minutes, B logs every 15 minutes.
Once a month, the data of these 2 devices is used in a calculation. Basically, the value of each set of data from A is multiplied by the value of B.
However, since the timestamps are different, there are 3 data rows from A for each data row from B.

This is what the tables look like :
A
-
id int not null (PK)
name varchar(30)
B_id int not null (indexed)

Adata
-----
A_id int not null (PK)
logdatetime timestamp not null (PK)
data decimal(7,3)

B
-
id int not null (PK)
name varchar(30)

Bdata
-----
B_id int not null (PK)
logdatetime timestamp not null (PK)
data decimal(7,3)


The joins are below in the first reply.

Both take a long time, although we're talking about only 8640 rows from Adata and 2880 from Bdata.
Even when getting just 3 days (instead of 1 month), it takes 50 seconds !

This is all running on MySQL Cluster 7.0.6 on 2 data nodes and 2 MySQL nodes.

Does anyone have a clue what I'm doing wrong here ?

Last edited by therunner; 07-18-09 at 17:41.
Reply With Quote
  #2 (permalink)  
Old 07-18-09, 17:41
therunner therunner is offline
Registered User
 
Join Date: Jul 2009
Posts: 6
Formatted it a bit better :

Code:
select
    Adata.logdatetime,
    Adata.data,
    Bdata.data
from
    A join Adata
        on A.id = Adata.A_id
    left join Bdata
        on A.B_id = Bdata.B_id
        and
        floor(Adata.logdatetime / 900) * 900 = Bdata.logdatetime
where
    A.id = 280
    and
    Adata.logdatetime between "2009-06-01 00:00:00" and "2009-06-30 23:59:59"
Another one I tried :
Code:
select
    Adata.logdatetime,
    Adata.data,
    Bdata.data
from
    A join Adata
        on A.id = Adata.A_id
    left join Bdata
        on A.B_id = Bdata.B_id
        and
        Adata.logdatetime between Bdata.logdatetime and addtime(Bdata.logdatetime, "00:15:00")
where
    A.id = 280
    and
    Adata.logdatetime between "2009-06-01 00:00:00" and "2009-06-30 23:59:59"
Reply With Quote
  #3 (permalink)  
Old 07-19-09, 05:35
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
Originally Posted by therunner
Does anyone have a clue what I'm doing wrong here ?
To be fair I haven't a clue what you're trying to do. I attempted to read your 1st post to see what you're trying to do but gave up. It would have helped if you described what you wanted in simple English rather than pseudo code, pseudo SQL or pseudo English. However I doubt if MySQL would use any indexes on logdatetime as you're using a function on the field and not it's actual value. I doubt if that's the only issue here though.

Mike
Reply With Quote
  #4 (permalink)  
Old 07-19-09, 07:56
therunner therunner is offline
Registered User
 
Join Date: Jul 2009
Posts: 6
First time I hear someone calling my English pseudo-English. But anyways... here's an attempt to explain things in 'simple' English :

There's 2 sets of physical devices (pieces of hardware) :
- Devices of type A : they log data every 5 minutes
- Devices of type B : they log different data every 15 minutes
Nothing I can do to change that...

Every month, the customer wants to see a report. This report needs to list the data captured from both devices.
The data from device A needs to be multiplied by the data from device B. But since they don't log at the same time, this is a problem.
Therefore, I want to get this :
Code:
A      B
00:00 00:00
05:00 00:00
10:00 00:00
15:00 15:00
20:00 15:00
25:00 15:00
30:00 30:00
So what I want to do is link every data row from B with the 3 data rows from A that are within those 15 minutes. That way, I can multiply the data from A with the data from B.

I haven't found any other solution (other than storing the B data rows 3 times, but that's just plain silly).

I hope this explains what I'm trying to do a bit better. Any help would be much appreciated !
Reply With Quote
  #5 (permalink)  
Old 07-19-09, 11:12
Pat Phelan Pat Phelan is offline
Resident Curmudgeon
 
Join Date: Feb 2004
Location: In front of the computer
Posts: 12,609
Not all versions of MySQL will support this code because of the sub-select within it, but my first attempt would be:
Code:
SELECT *
   FROM A AS a
   JOIN AData AS ad
      ON (ad.a_id = a.id)
   JOIN BData AS bd
      ON (bd.b_id = a.b_id
      AND bd.logdatetime = (SELECT Max(x1.logdatetime)
         FROM AData AS z1
         WHERE  x1.a_id = a.id
            AND x1.logdatetime <= bd.logdatetime)
   WHERE  280 = a.id
      AND Adata.logdatetime BETWEEN "2009-06-01 00:00:00" AND "2009-06-30 23:59:59"
-PatP
__________________
In theory, theory and practice are identical. In practice, theory and practice are unrelated.
Reply With Quote
  #6 (permalink)  
Old 07-19-09, 12:25
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
I should of said simple English. Hopefully Pat's solution works for you but, if not, can you post the create table DDL with indexes and we'll have another look.
Reply With Quote
  #7 (permalink)  
Old 07-20-09, 09:14
therunner therunner is offline
Registered User
 
Join Date: Jul 2009
Posts: 6
I'm afraid that query doesn't work well. I stopped it after it ran for over 20 minutes.

As requested, the DDL is below. Note that 'A' in fact is called 'channel' and 'B' is called 'pb' :
Code:
CREATE TABLE `channel` (
  `channel_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `channel_name` varchar(25) DEFAULT NULL,
  `pb_id` int(11) DEFAULT NULL,
  `channel_lastdownload` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`channel_id`),
  KEY `channel_pb_id` (`pb_id`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1;

CREATE TABLE `channeldata` (
  `channeldata_datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `channel_id` int(10) unsigned NOT NULL,
  `channeldata_data` decimal(6,3) DEFAULT NULL,
  PRIMARY KEY (`channeldata_datetime`,`channel_id`),
  KEY `channeldata_FKIndex1` (`channel_id`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1;

CREATE TABLE `pb` (
  `pb_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `pb_name` varchar(25) DEFAULT NULL,
  `pb_lastdownload` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`pb_id`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1;

CREATE TABLE `pbdata` (
  `pbdata_datetime` datetime NOT NULL,
  `pb_id` int(10) unsigned NOT NULL,
  `pbdata_data` decimal(7,3) DEFAULT NULL,
  PRIMARY KEY (`pbdata_datetime`,`pb_id`),
  KEY `pbdata_FKIndex1` (`pb_id`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1;
Reply With Quote
  #8 (permalink)  
Old 07-20-09, 12:42
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
I'm afraid I have to nip out this evening but I think I can see the first issue. The field pb_id is defined differently in the tables - this will upset the optimiser in most RDBMSs.
  • pb_id int(11) DEFAULT NULL
  • pb_id int(10) unsigned NOT NULL
Also the pbdata_datetime field is defined as a datetime and not as a timestamp. I'd certainly start by standardising the definition of fields across tables.
Reply With Quote
  #9 (permalink)  
Old 07-20-09, 13:12
therunner therunner is offline
Registered User
 
Join Date: Jul 2009
Posts: 6
I changed pb_id to unsigned int(10) everywhere. Also changed to timestamp.
Makes no difference in performance tough...
Reply With Quote
  #10 (permalink)  
Old 07-21-09, 14:14
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
Originally Posted by therunner
I changed pb_id to unsigned int(10) everywhere. Also changed to timestamp.
Makes no difference in performance tough...
The solutions already given look correct but as you've already noted the join is causing you issues. Can you create some intermediary tables that allow a proper join and then run your report on these tables?

If this doesn't improve things then seeing as the customer only wants to see the report once a month then why not just run the report overnight (perhaps off a snapshot of the live tables)?
Reply With Quote
  #11 (permalink)  
Old 07-21-09, 17:17
therunner therunner is offline
Registered User
 
Join Date: Jul 2009
Posts: 6
In fact, I think Pat's join isn't correct.
I don't see the join of data from channeldata and pbdata working at all... there's no reference to the 15 minute difference between the data ?
Reply With Quote
  #12 (permalink)  
Old 07-21-09, 18:02
healdem healdem is online now
Jaded Developer
 
Join Date: Nov 2004
Location: out on a limb
Posts: 9,260
be very very careful about which date datatype you choose.
timestamp isn't the right one in my books.. you want a datetime datatype.
using timestamp may well lead to data corruption......
__________________
I'd rather be riding my Versys or my Tiger 800 let alone the Norton
Reply With Quote
  #13 (permalink)  
Old 07-21-09, 20:22
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
Originally Posted by therunner
In fact, I think Pat's join isn't correct.
I don't see the join of data from channeldata and pbdata working at all... there's no reference to the 15 minute difference between the data ?
I guess you have the advantage of being able to run the SQL against some real tables where as we have to try and just picture the results. We're not particularly helped by your initial description of the problem, then we have one set of table/field names in your SQL and then another totally different set in your DDL. And sadly neither set of names make any sense to me. So you can't expect instant answers to your problem.

You don't need to specify the 15 minute gap, in fact hard coding this into the SQL is probably a mistake. It's normally better to simply look for values between one timestamp and the next greatest timestamp. Also as healdem says, if the value your getting is a datetime value then you should really be storing it as a datetime value rather than declaring it as a timestamp then treating it as an integer one moment and a datetime the next.
Reply With Quote
  #14 (permalink)  
Old 07-22-09, 06:54
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
This is my attempt at the code. I've split it into a few subselects to try and give the optomiser a fair chance at working out what it should do first. I don't use the more modern JOIN syntax but that's simply because I'm ancient.
Code:
select channel.channel_name, pb.pb_name, timedata.channeldata_datetime,
        timedata.pbdata_data,  channeldata.channeldata_data
from  ( select channeldata.channeldata_datetime, channeldata.channel_id,
               channeldata.channeldata_data, t1.data
        from   ( select pbdata.pbdata_datetime as from_time, t2.to_time, 
                        pbdata.pb_id, pbdata.pbdata_data
                 from   pbdata,
                        ( select min(b2.pbdata_datetime) as to_time
                          from   pbdata b2
                          where  b2.pb_id = pbdata.pb_id
                                 and b2.pbdata_datetime > pbdata.pbdata_datetime
                                 ) t2 
                 where  pbdata.pbdata_datetime between "2009-06-01 00:00:00" 
                                                   and "2009-06-30 23:59:59"
                        ) t1,
                channeldata
        where   channeldata.channel_id = 280
                and channeldata.channeldata_datetime >= t1.from_time
                and channeldata.channeldata_datetime < t1.to_time
                ) timedata,
      channel,
      pb
where channel.channel_id = timedata.channel_id
      and pb.pb_id = timedata.pb_id
No doubt I've screwed up on the field names a few times, you can only type channeldata.channeldata_data a few times before I start to loose it. Sadly after after typing all the field names in above I've now run out of energy to go on about how to name fields. I guess if you're happy to type channeldata.channeldata_datetime each time then all's fine - that's presuming you remember it's not actually a datetime.

Mike
Reply With Quote
  #15 (permalink)  
Old 07-22-09, 09:43
dav1mo dav1mo is offline
Registered User
 
Join Date: Dec 2007
Location: Richmond, VA
Posts: 782
going back to your first shot at the SQL. How about something like:
select
Adata.logdatetime,
Adata.data,
Bdata.data
from
A join Adata
on A.id = Adata.A_id
and Adata.logdatetime between "2009-06-01 00:00:00" and "2009-06-30 23:59:59"
left join Bdata
on A.B_id = Bdata.B_id
and
Adata.logdatetime >= Bdata.logdatetime + 14 minutes
where
A.id = 280
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On