Results 1 to 14 of 14
  1. #1
    Join Date
    Jun 2003
    Posts
    4

    Unanswered: Larg(ish)-scale pgsql?

    Anybody have any experience deploying PostgreSQL in a fairly sizable (but not enormous) data warehousing application - say half a terabyte or so?

    Just wondering if it's remotely possible to scale pgsql to that sort of degree, - or if I'm dreaming, and should go back to the world of high-priced proprietary DBMSs.

    Any personal stories, links to case studies, etc. helpful.

  2. #2
    Join Date
    Sep 2001
    Location
    Vienna
    Posts
    400

    data warehousing

    what do you want to hear?

    we have been in greece to train there some person in postgresql
    they have about 1 billion data every day (loggiong data from a telco)

    Second german data warehousing with blobs
    2 Terrabyte

    the only problem with such enormous data is that you have enogh space on the hardware

    this is the only issue you have
    http://www.postgresql.org
    --Postgresql is the only kind of thing--

  3. #3
    Join Date
    Jun 2003
    Posts
    4
    Thanks for the reply, it's encouraging to know that others have had some success with larger pgsql databases - in addition to this example, I've come across one or two other similar instances in the PostgreSQL mailing list archives.

    Still, the overwhelming majority of pgsql deployments seem to be quite a bit smaller than what I am considering.

    I'm sorry that perhaps I didn't spell out my question explicitly enough, but I'm concerned about PostgreSQL's ability to handle to usual sticking points in warehousing applications: the effeciency of bulk load and backup/restore operations, and query performance against very large, denormalized datasets.

  4. #4
    Join Date
    Sep 2001
    Location
    Vienna
    Posts
    400

    data warehousing

    ok let's see what I can do

    maybe you take a look at

    EFEU

    this is an OLAP like tool you can use with postgres.
    It's gpl.

    We will work on an real OLAP Application in the near future.

    We were working with the austrian socaial data and I think in version 7.4 the performance is increased dramatically for such things as querys on denormalized datasets

    Bulk load you can handler with the COPY command
    The performacnde depends on hardware and so on
    but I think it will be better you test the copy command with generated data and that meet your needs and then test it

    Backup/restoring
    use pg_dump -> consistent backup (transaction mode is serializable)
    you can make a binary backup of your database.

    if you want a commercial solution netvault made recently a postgresql backup solution.

    hope that halps a bit
    http://www.postgresql.org
    --Postgresql is the only kind of thing--

  5. #5
    Join Date
    Mar 2003
    Posts
    5

    Unhappy

    I'm having some issues with a very large postgres db on Linux. Several tables with over 10 million records. One of these tables is difficult to vacuum - requires several hours and that i stop some critical processes. If I don't vacuum for more than a day, there is easily over a million deletions. Right now I think the server its running on is a dual 1.8G processor with 2G ram... We're in the process of upgrading, but seems like its a hardware limitation problem we're having. Any thoughts?

  6. #6
    Join Date
    Sep 2001
    Location
    Vienna
    Posts
    400

    what version

    what version of postgres are you using?
    http://www.postgresql.org
    --Postgresql is the only kind of thing--

  7. #7
    Join Date
    Mar 2003
    Posts
    5
    7.3

    and some other possible info...
    seems that when the database is almost done with a very large query or vacuum process, the load spikes on the database machine.
    Last edited by LayOut; 07-14-03 at 01:27.

  8. #8
    Join Date
    Mar 2003
    Posts
    5
    I completed the vacuum on that heavily used table last night -
    took 4 hours.

    I have this monster and another that takes 3 hours
    and one that takes over 90 min.

    I guess it's all about scheduling for me now. I have a process which queries the db and runs for 2 to 5 hours depending on system load. The initial query for this process runs about 20-30 min. It needs to run constantly during the day, and if the user load is low (night time) I'll run multiples of this to maximize my output.

    It seems as though I have hit the ceiling for this system.
    Any other experiences you all may have had would be good to hear about.

    Thanks.
    LayOut

  9. #9
    Join Date
    Sep 2001
    Location
    Vienna
    Posts
    400

    questions

    A few questions about this

    how big is the database?

    how much rows are in the problem zone?

    what does this process?
    (select,insert,copy update)
    http://www.postgresql.org
    --Postgresql is the only kind of thing--

  10. #10
    Join Date
    Mar 2003
    Posts
    5
    The database is approximately 60 Gig

    The table to be vacuumed has 8.7 Million rows, 7 cols

    The process that runs does a SELECT from this table and takes about 20-30min. As it loops through the records, it will either update or delete the record (modifying about 200K records per 3-5 hour run). This process runs constant through the day.

    One more daily process also modifies that table, adding roughly 1 Million records.

    It is possible that the size of the database could be reduced in the future.

  11. #11
    Join Date
    Sep 2001
    Location
    Vienna
    Posts
    400

    problem

    how do you insert the 1 millions rows daily

    INSERT OR COPY???

    If you do it with insert you will have to vacuum the database after this step

    this is the first thing

    the process runs all day long and is updateing deleting rows

    that explains why the vacuum takes so long
    there is much transaction overhead then

    maybe your only chance is a automatic vacuum behind but thats not implemented yet

    I thinnk there is a vacuuming daemon on gborg.postgresql.org
    take a look at that maybe it helps
    http://www.postgresql.org
    --Postgresql is the only kind of thing--

  12. #12
    Join Date
    Mar 2003
    Posts
    5
    The 1 million updates are inserts. I've thought about vacuuming after every process run, but seems to be more time efficient to do it once a day.

    I couldn't find the vacuum daemon you spoke of at gborg.

    Would you say this is a limitation of postgres compared to one of the alternative expensive databases?

    Are they working on an auto vacuum now? Is there discussion on when it may be accomplished?

  13. #13
    Join Date
    Sep 2001
    Location
    Vienna
    Posts
    400

    vacuum

    you have huge inserts in the database
    and that all time long
    what you can do is to switch to another transactino mode

    i think this is a 24/7 system.

    in the 7.4 version there is something called read transactions
    only for selects

    and try to switch to serializable mode.
    it should be faster then but I don't know what are you doing excatly with the transactions



    you can also try
    http://gborg.postgresql.org/project/...rojdisplay.php

    this is what i meant
    http://gborg.postgresql.org/project/...rojdisplay.php


    there will be an autovacuuming in the next releases
    maybe 7.5

    but don't count on that

    use this vacuuming daemon which is i think table based
    http://www.postgresql.org
    --Postgresql is the only kind of thing--

  14. #14
    Join Date
    Sep 2001
    Location
    Vienna
    Posts
    400

    thats just come in

    maybe helpful to you:

    ANNOUNCEMENT: Availability of TelegraphCQ v0.2 (BETA release)
    -------------------------------------------------------------

    The Telegraph Team at UC Berkeley is pleased to announce the immediate
    availability of TelegraphCQ v0.2 (a BETA release). TelegraphCQ is a
    system for processing long-running continuous queries over data
    streams. TelegraphCQ is implemented as a significant modification to
    the PostgreSQL open-source database system. Specifically, TelegraphCQ
    v0.2 is based on the PostregreSQL 7.3.2 code-base.

    For more information on the Telegraph project please visit:

    http://telegraph.cs.berkeley.edu

    To download TelegraphCQ and for more information, please visit:

    http://telegraph.cs.berkeley.edu/telegraphcq

    Here is the README.TelegraphCQ from the source distribution:

    ----------------------------------------------------------------------
    This file contains information on TelegraphCQ 0.2 (a Beta releae).

    What's new (since TelegraphCQ 0.1)
    ---------------------------------

    - Better stability: Lots of bugs have been squashed

    - CSV Wrapper: An easy to use wrapper that accepts simple comma
    separated text as input and supports all PostgreSQL and user-defined
    types. There is now no need to write C code to develop a new wrapper
    if your data source produces textual data.

    - Pull Wrappers: If you want to pull (lookup) data from external
    sources (such as web pages) based on query predicates, you can use the
    new infrastructure for the Pull Wrapper.

    - Introspective Query Processing: The dynamic catalog is a new feature
    in TelegraphCQ. It lets users query three special purpose internal
    data streams to get a continuous view of how the TelegraphCQ system is
    behaving.

    - Applications: Included with this distribution are a few applications
    that we demonstrated at the ACM FCRC/SIGMOD 2003 Conference in San
    Diego. We have integrated data from CalTrans (California Department of
    Transportation), CHP (California Highway Patrol) and a local network
    interface (using the tcpdump utility). We have also included graphical
    visualizers that run some interesting queries over data streams from
    these sources.

    Request
    -------

    If you are using TelegraphCQ for any project (research or otherwise),
    we'd like to know !

    - If you are making any performance measurements of TelegraphCQ and
    are publishing it or disseminating it in some way, please do let us
    know.

    - We welcome bug fixes and code contributions of any kind. Please
    send all patches to tcqfeedback@db.cs.berkeley.edu

    Caveat Emptor
    -------------

    The current version of TelegraphCQ (0.2) is still a Beta release and
    is very much research code. We haven't optimized it for
    performance. Also, you will need to create a fresh database to use it
    as it will not work against an existing PostgreSQL database. You are
    free to use pgdump to export and then import your data from an
    existing PostgreSQL database.

    - Archived streams: Note that although TelegraphCQ 0.2 supports
    archived streams, we currently do not support access of historical
    data. This will be fixed in our next release.

    For more information
    --------------------

    Getting Started:

    A TelegraphCQ Getting Started guide can be found in
    docs/TelegraphCQ/gettingstarted.html

    The Getting Started guide also has links to other information such
    as DDL/DML syntax, examples etc. Note that the documentation
    installed in your TelegraphCQ installation directory contains
    documentation for the new syntax changes (CREATE/ALTER/DROP STREAM,
    the WINDOW clause in DML) integrated into the existing PostgreSQL
    documentation.

    The Telegraph Project:

    http://telegraph.cs.berkeley.edu

    Supported Platforms
    -------------------

    TelegraphCQ is supported on Linux x86 (various distributions including
    Red Hat Linux 8,9 and Debian) and MacOS X 10.2, the platforms on which
    it was developed. In addition, it has been minimally tested on the
    SPARC Solaris platform.

    In general, TelegraphCQ should run anywhere PostgreSQL runs. However,
    there are significant differences in how we use shared memory and
    synchronization primitives such as semaphores. This is likely to cause
    some hiccups on platforms where we have not tested it yet. We welcome
    patches that help in porting TelegraphCQ.


    Please send all comments (and patches) to tcqfeedback@db.cs.berkeley.edu

    ----------------------------------------------------------------------
    http://www.postgresql.org
    --Postgresql is the only kind of thing--

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •