Results 1 to 5 of 5

Thread: Query Speed

  1. #1
    Join Date
    Apr 2007
    Posts
    130

    Unanswered: Query Speed

    Hi All,

    The execution time for the following query statement exceeds 25s. Please tell me of where I am wrong

    I have 3 tables DocsFeature, CandidateDocs, ExampleDoc.

    Generally, the query is used to find similar documents which are relevant to an example document. The example document is represented by a set of feature words listed in table ExampleDoc. Only those documents are SELECTED that share at least one word with the example document and whose DocID appear in CandidateDocs.

    1.
    DocsFeature contains feature words extracted from a collection of documents.

    DocsFeature has the following fields
    Code:
    DocID  char(8) (Serial Number)
    Word   varchar(20) (Feature words)
    Frequent  tinyint   (Occurrence times for a word in this document)
    Tag     char(2)     (v for verb, n for noun)
    DocsFeature has index on DocID and Word

    sample records are as follows
    Code:
    ....
    A111    Computer     9    n (Document A111 contains 'Computer',and it occurres 9 times)
    C27      Dolphin         5   n (Document C27 contains 'Dolphin',and it occurres 5 times)
    
    D21      Food            2    n (Document D21 contains 'Food',and it occurres 2 times)
    
    .....

    2.

    CandidateDocs contains a list of DocID, having only one field

    Code:
    DocID   char(8) (Serial Number)
    CandidateDocs has index on DocID

    3. ExampleDoc fields
    Code:
    Word   varchar(20) (Feature words)
    Frequent  tinyint   (Occurrence times for a word in the example doc)
    Tag     char(2)     (v for verb, n for noun)
    ExampleDoc has index on Word.

    I do the following query
    Code:
    SELECT 
           a.DocID, SUM(a.Frequent + e.Frequent) AS Freq 
    
    FROM 
            DocsFeature a, ExampleDoc e, CandidateDocs b 
    WHERE
            a.Word = e.Word     AND  a.DocID  = b.DocID
    There are 10 million records in CandidateDocs, about 90 records in ExampleDoc, and 20 thousand records in CandidateDocs.

    In this case, It would takes 25s.

    Please help me to improve the eifficiency of this statement.
    any help will be appreciated very much.

  2. #2
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    I didn't read your text but I did notice that the group by was missing:
    Code:
    SELECT a.DocID, SUM(a.Frequent + e.Frequent) AS Freq 
    FROM   DocsFeature a, ExampleDoc e, CandidateDocs b 
    WHERE  a.Word = e.Word     AND  a.DocID  = b.DocID
    GROUP  BY a.DocID
    You could make it a lot faster by restricting what you're showing rather than just joining every record to every other record. If you have 10m records in your tables it might be worth investing a little time and getting a little stronger at SQL.

    PS it would have been more helpful to simply have the create table and index statements.

  3. #3
    Join Date
    Apr 2007
    Posts
    130
    Quote Originally Posted by mike_bike_kite
    I didn't read your text but I did notice that the group by was missing:
    Code:
    SELECT a.DocID, SUM(a.Frequent + e.Frequent) AS Freq 
    FROM   DocsFeature a, ExampleDoc e, CandidateDocs b 
    WHERE  a.Word = e.Word     AND  a.DocID  = b.DocID
    GROUP  BY a.DocID
    You could make it a lot faster by restricting what you're showing rather than just joining every record to every other record. If you have 10m records in your tables it might be worth investing a little time and getting a little stronger at SQL.

    PS it would have been more helpful to simply have the create table and index statements.
    Thanks, mike_bike_kite.
    Yes, I forgot the trailing 'GROUP BY a.DocID'. I just typed those words from memory.

    the following are the create statements for DocsFeature and ExampleDoc,

    Code:
    CREATE TABLE `DocsFeature` (
      `DocID` varchar(8) COLLATE latin1_bin NOT NULL,
      `Word` varchar(20) COLLATE latin1_bin NOT NULL,
      `Freq` tinyint(3) unsigned NOT NULL DEFAULT '0',
      `TitleWord` tinyint(1) NOT NULL,
      `CommonWord` tinyint(1) NOT NULL,
      `NrNsNtNz` char(2) COLLATE latin1_bin NOT NULL,
      KEY `idx_DocID` (`DocID`),
      KEY `idx_Word` (`Word`),
      KEY `idx_NrNsNtNz_TitleWord` (`NrNsNtNz`,`TitleWord`),
      KEY `idx_TitleWord` (`TitleWord`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_bin | 
    
    
     CREATE TABLE `ExampleDoc` (
      `Word` varchar(20) DEFAULT NULL,
      `Freq` tinyint(3) unsigned NOT NULL DEFAULT '0',
      `TitleWord` tinyint(1) DEFAULT NULL,
      `CommonWord` tinyint(1) DEFAULT NULL,
      `NrNsNtNz` char(2) DEFAULT NULL,
      KEY `idx_Word` (`Word`),
      KEY `idx_NrNsNtNz` (`NrNsNtNz`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1
    As for CandidateDocs, it has only one column, DocID. It is created as a temp table on the fly when the program is running.

    The SQL statement which I am complaining is as follows
    Code:
    CREATE TABLE TempTable_1   
       AS SELECT   
          DocsFeature.DocID AS DocID, SUM( DocsFeature.Freq+  ExampleDoc.Freq) Score  
       FROM  
           DocsFeature, ExampleDoc, CandidateDocs
        WHERE 
             DocsFeature.Word = ExampleDoc.Word  
            AND CandidateDocs.DocID =  DocsFeature.DocID  
        group by 
                   DocsFeature.DocID
        HAVING COUNT(DocID) > 1;
    That is, I need to put the query result into a temp table.

    I have search and study many MySQL optimazition articals. However, up to now, i have not gain any significant improvement. Anyone can help me.
    Last edited by cy163; 08-13-09 at 11:44.

  4. #4
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    how long have you been working on this?

    i seem to recall helping you with it several times over what seems like several years

    have you learned how to do an EXPLAIN yet?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  5. #5
    Join Date
    Apr 2007
    Posts
    130
    The EXPLAIN result is as follows

    Code:
    +----+-------------+----------------+------+------------------------+------------------------+---------+-------------------------------+------+---------------------------------+
    | id | select_type | table          | type | possible_keys          | key                    | key_len | ref                           | rows | Extra                           |
    +----+-------------+----------------+------+------------------------+------------------------+---------+-------------------------------+------+---------------------------------+
    |  1 | SIMPLE      | ExampleDoc     | ALL  | idx_Word               | NULL                   | NULL    | NULL                          |  102 | Using temporary; Using filesort | 
    |  1 | SIMPLE      | DocsFeature    | ref  | idx_DocID,idx_Word     | idx_Word               | 22      | MANAGERIE.ExampleDoc.Word  |   61 | Using where                     | 
    |  1 | SIMPLE      | CandidateDocs  | ref  | idx_DocIDCandidateDocs | idx_DocIDCandidateDocs | 8       | MANAGERIE.DocsFeature.DocID   |    1 | Using where; Using index        | 
    +----+-------------+----------------+------+------------------------+------------------------+---------+-------------------------------+------+---------------------------------+
    Is it possible for me to further improve the query?
    Last edited by cy163; 08-12-09 at 11:03.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •