Results 1 to 2 of 2
  1. #1
    Join Date
    Apr 2006
    Posts
    1

    Unanswered: Performance issue

    I have a database in Postgres [8.3.3, RedHat WS4, Dual Core 2.4GHz, 4GB RAM] with two tables that look approximately like this:

    - A table of assets, each row with one column of unique-per-table values; think of it as a serial number

    - A table of events, each row with a column that references one of the assets

    I can't use referential integrity, because I hear about "new" assets by getting a row in the event table which doesn't have a corresponding asset yet. I have a little PHP script that runs periodically that tries to fill in the asset table with details from an external source, but it's an imperfect science: not all asset details can ever be known.

    I started with a query to find the list of assets to go try to look up:

    Code:
        SELECT DISTINCT thingId from events WHERE
          thingId NOT IN (SELECT thingId from assets);
    Right? Give me all the thingId that don't have a corresponding asset row.

    Worked fine. Later, the events table got bigger, and this slowed down. I read somewhere that DISTINCT can get expensive in Postgres, so I changed it to use GROUP BY -- this is apparently a standard trick in Postgres (and maybe other SQL databases?):

    Code:
        SELECT thingId from events WHERE
          thingId NOT IN (SELECT thingId from assets)
          GROUP BY thingId;
    This ran much faster, though of course the resulting list came out unsorted. Fine for my application.

    Yesterday, I must have crossed some threshold: this query now "runs forever" -- I let it run overnight, and it consumes 100% of a hyperthread.

    There are ~37k assets and ~12M events and the events table has ~45k unique assets (as counted by "select count(distinct thingId) from events"). This doesn't seem like a lot of data, and I was planning on having a lot more. Each row of both assets and events is on the order of 128 bytes.

    If I perform this operation "by hand" it runs in about 15 seconds (excuse the pseudo PHP; queryOrDie() just does what you think it does, prepare/execute, etc.):

    Code:
        $q1 = "SELECT thingId FROM assets";
        $s1 = queryOrDie($db, $q1);
        $got = array();
        while ($r1 = $s1->fetch(PDO::FETCH_ASSOC))
            @$got[$r1['thingId']] = $r1['thingId'];
    
        $q2 = "SELECT thingId FROM events GROUP BY thingId";
        $s2 = queryOrDie($db, $q2);
        $get = array();
        while ($r2 = $s2->fetch(PDO::FETCH_ASSOC))
            if (empty($got[$r2['thingId']]))
                $get[] = $r2['thingId'];
    At this point, $get[] is an array of thingId that is the same as the result from the never-ending query above.

    Is there something I can do to figure out why this query is taking so long, or can someone spot a "rookie move" in my query? I obviously often do the equivalent of

    Code:
        SELECT * FROM events where thingId=?
    so I do have an INDEX on events.thingId.

    This seems like a pretty simple setup ... yet, I'm stumped. Ideas?

  2. #2
    Join Date
    Nov 2003
    Posts
    2,935
    Provided Answers: 12
    Quote Originally Posted by jmhayes
    Yesterday, I must have crossed some threshold: this query now "runs forever" -- I let it run overnight, and it consumes 100% of a hyperthread.
    This sounds like there is something other going on here. With those number of rows in the tables this should not take that long. Did you monitor the system to check what it's doing? e.g. disk IO or maybe the OS was swapping because something else was running on the system. I think the Linux tool is named iostats or something similar (sorry I'm not a Linux guy). You should also check the various pg_stats views maybe something shows up there (e.g. a lock from another transaction)

    As you are running 8.3 I would assume that you have autovacuum running. But if the select takes that long that still might be caused by too many dead tuples in the tables (caused by updates or deletes).
    Maybe you need to increase the frequence or other auto vacuum settings in order to deal with the dead tuples.

    Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT - PostgreSQL Wiki
    Vacuuming - PostgreSQL Wiki

    Is there something I can do to figure out why this query is taking so long, or can someone spot a "rookie move" in my query?
    The first starting point is always the execution plan

    You should also update to the latest 8.3 release. I had quick look through the release notes, and there are at least three bug fixes to wrong execution plans together with an IN subselect (although they might not be relevant in your situation)
    As this is only a minor upgrade 8.3.3 -> 8.3.7 it is a simple as replacing the binaries (of course while the server is shut down).

    You might try with an outer join[code]SELECT thingId
    FROM events e LEFT JOIN assets a ON e.thingId = a.thingID
    WHERE e.thingId IS NULL[/quote]Maybe this gives a better plan (but my bet would be that it's a vacuum problem)

    so I do have an INDEX on events.thingId.
    Did you verify that you have an index, or are you just assuming that?

    You also should have an index on the assets.thingid.

    Take this question to the PSQL performance mailing list, you will get much better and more detailed answers there (But be prepared to explain why you are not running the latest version )

    If you did not tune your postgresql.conf that might also cause some performance problems (e.g. because PG uses the wrong execution plan)

    Check out: Performance Optimization - PostgreSQL Wiki.

    PG Tune could also be a good starting point.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •