Results 1 to 3 of 3
  1. #1
    Join Date
    Jan 2010
    Posts
    4

    Unanswered: Windows 7 performance - large dataset

    I have a pretty large dataset in my pgdb (largest table = 39 million rows) that I am working with. The db is installed on my Windows 7 desktop, because that's the only real option for me currently. When working with the data I thought that some of the operations were exceptionally slow, so I have tried my best to read through performance boosting articles etc. After a bit of tweaking, the queries are executed faster, but the one I'm doing now is expected to finish after more than 5 days, and that is a long time to wait for a result. So I was hoping to get some tips about how I can improve performance.

    My main table is raw data from fish telemetry loggers (I work with fish research at a university). Every time a fish is detected at a logger, I get a row with date/time, fish-ID, site-name and signal strength. In total 39 million rows.

    The function I am currently running is working through all those rows, one fish at a time and one site at a time. So for fish 1 and site 1, I scan rows until there is a time-gap between detections of at least 5 minutes. When such a gap is found, I store the time-span (first and last detection) and min/max/median signal strength as well as total number of detections and detections per minute.

    For the first few fishes, this is how my timing looks like:
    Code:
    Tag-ID  Totalrows ProcessTime RowsPerSec
    1           5520       0.722   7645
    2           2636       0.473   5573
    3        1283671   44682.820     29
    4           1362     121.327     11
    5         577684   51730.610     11
    6            736      73.698     10
    So, as you can see, the number of rows from rawdata processed per second decreases radically when the fish with over a million rows is processed, and then the performance seems to decline with each fish being processed. Since I have 100 fish to look at, I hope this doesn't continue to decline.

    My function involves a temporary table to figure out the max/min/median of each 5-minute group. Is that what is causing the performance issues? Any tips of how to get the same performance on the really large queries as I have on the small ones would be greatly appreciated.

    Pasting my function below. I have tried to find things on the net to do things faster, but this is the best I can do so far.

    I'm calling the function with:
    SELECT Telemetry2009.fDoGordyFilteringDetails(tagid, 5, 'A1') FROM tFishTags GROUP BY tagid;

    Code:
    CREATE OR REPLACE FUNCTION Telemetry2009.fDoGordyFilteringDetails(thistag INTEGER, gaptime INTEGER, thisantenna VARCHAR(10))
    	RETURNS INTEGER AS
    $BODY$
    DECLARE
    	tm_row		RECORD;
    	tm_cursor	CURSOR FOR SELECT * FROM Telemetry2009.rawdata
    					WHERE tagid = thistag
    					  AND antenna = thisantenna
    					  AND date_time > TIMESTAMP '2009-07-06 12:00:00'
    					ORDER BY date_time ASC;
    	maxtime		TIMESTAMP WITH TIME ZONE;
    	thisgap		DOUBLE PRECISION;
    	save		RECORD;
    	medians		INTEGER;
    	i		INTEGER;
    BEGIN
    	-- Create a temporary table for data in the function.
    	--
    	-- Exception handling is the fastest way to check if the temporary table already exists.
    	--
    	-- Temporary tables are dropped at the end of each transaction;
    	-- i.e. SELECT fDoGordyFiltering(tagid) FROM rawdata GROUP BY tagid; keeps the temporary 
    	-- table alive until the query is completed and returns the tagid's, then the temporary 
    	-- table is dropped. So we need to create the table for the first tagid, and the following
    	-- tagid's will just delete previous rows from the table.
    	BEGIN
    		DELETE FROM tempdata;
    	EXCEPTION 
    		WHEN OTHERS THEN
    			-- This is a copy of the Telemetry2009.rawdata table structure, but will 
    			-- only hold the rows we encounter until 'gaptime' is exceeded in a gap.
    			CREATE TEMP TABLE tempdata
    			(
    				date_time TIMESTAMP WITH TIME ZONE,
    				antenna VARCHAR(10),
    				tagid INTEGER,
    				signal INTEGER
    			);
    			-- Another, more dynamic, way would be 
    			-- CREATE TEMP TABLE tempdata AS SELECT * FROM Telemetry2009.rawdata LIMIT 1;
    			-- DELETE FROM tempdata;
    			-- But I think it would be a lot slower.
    	END;
    
    	i := 0; -- just to keep track of progress
    
    	-- Loop through all the rows in rawdata for this tag and antenna, ordered by date_time.
    	FOR tm_row IN tm_cursor LOOP
    		i := i + 1; -- progress tracker
    		IF i > 100000 THEN
    			RAISE NOTICE '\n\n100,000 rows processed at %\n', CLOCK_TIMESTAMP();
    			i := 0;
    		END IF;
    		SELECT MAX(date_time) INTO maxtime FROM tempdata;
    		-- If "time of latest row + gaptime" > "current row time" then we should start a
    		-- new group. Otherwise, we'll just save current row in the temporary table.
    		--
    		-- Remember that if this is the first time through the loop, maxtime will be NULL.
    		IF maxtime IS NOT NULL THEN
    			thisgap := EXTRACT(MONTH	FROM (tm_row.date_time - maxtime)) * 60 * 24 * 30 + 
    					EXTRACT(DAY 	FROM (tm_row.date_time - maxtime)) * 60 * 24 + 
    					EXTRACT(HOUR 	FROM (tm_row.date_time - maxtime)) * 60 + 
    					EXTRACT(MINUTE	FROM (tm_row.date_time - maxtime)) * 1 + 
    					EXTRACT(SECOND	FROM (tm_row.date_time - maxtime)) / 60; 
    			IF thisgap > gaptime THEN
    				SELECT MIN(date_time) AS fd, MAX(date_time) AS ld, COUNT(*) AS c, 
    					MAX(signal) AS maxs, MIN(signal) AS mins INTO save FROM tempdata;
    				-- We need to calculate a new timegap that is within the gaptime.
    				-- (previous gaptime was including the detection that lies outside
    				--  of gaptime, which could be months away).
    				IF (save.ld - save.fd) < INTERVAL '1 minute' THEN 
    					-- DetPerMin >= 1
    					thisgap := 1; 
    				ELSE 
    					thisgap := ROUND((EXTRACT(MONTH	FROM (save.ld - save.fd)) * 60 * 24 * 30) + 
    							(EXTRACT(DAY 	FROM (save.ld - save.fd)) * 60 * 24) + 
    							(EXTRACT(HOUR 	FROM (save.ld - save.fd)) * 60) + 
    							(EXTRACT(MINUTE	FROM (save.ld - save.fd)) * 1) + 
    							(EXTRACT(SECOND	FROM (save.ld - save.fd)) / 60)); 
    				END IF;
    				-- Store median signal.
    				-- Nifty little trick using window functions. Totally stolen code.
    				SELECT ROUND(AVG(signal)) INTO medians 
    					FROM (SELECT signal, ROW_NUMBER() OVER
    							(ORDER BY SIGNAL), COUNT(*) OVER ()
    						FROM tempdata) t
    						WHERE ROW_NUMBER BETWEEN FLOOR((COUNT::FLOAT8 - 1) / 2 + 1)
    								 AND	 CEIL((COUNT::FLOAT8 - 1) / 2 + 1);
    				/* My first attempt at doing median signal:
    				--
    				-- Loop through tempdata to find the middle, then store that signal
    				-- as median signal. It may be a less than perfect way to calculate
    				-- median signal, but it is good enough I think.
    				DECLARE
    					med_row		RECORD;
    					med_cursor	CURSOR FOR SELECT signal FROM tempdata ORDER BY signal;
    					i		INTEGER := 0;
    				BEGIN
    					FOR med_row IN med_cursor LOOP
    						i := i + 1;
    						medians := med_row.signal;
    						-- count(tempdata) / 2 = middle so exit 
    						EXIT WHEN i = FLOOR(save.c / 2);
    					END LOOP;
    				END;
    				*/
    				-- Save this group into a row in the GordyRAW table.
    				INSERT INTO Telemetry2009.tGordyRAW VALUES
    				(
    					thisantenna,			-- Antenna
    					thistag,			-- TagID
    					save.fd,			-- FirstDetection
    					save.ld,			-- LastDetection
    					save.c,				-- NumDetections
    					ROUND(save.c / thisgap),	-- DetPerMin
    					save.maxs,			-- MaxSignal
    					save.mins,			-- MinSignal
    					medians				-- MedianSignal
    				);
    				-- Clear the temporary table since a new time period is started when
    				-- gaptime is exceeded.
    				DELETE FROM tempdata;
    			END IF;
    		END IF;
    		-- Save data for filtering.
    		INSERT INTO tempdata VALUES (tm_row.date_time, tm_row.antenna, tm_row.tagid, tm_row.signal);
    	END LOOP;
    	RETURN thistag;
    END;
    $BODY$
    LANGUAGE 'plpgsql' VOLATILE
    COST 100;

  2. #2
    Join Date
    Jan 2010
    Posts
    4

    Addition

    Addition:

    Some of the contents (just removed what I didn't think was important for my question) of my conf file:

    Code:
    # -----------------------------
    # PostgreSQL configuration file
    # -----------------------------
    #
    #------------------------------------------------------------------------------
    # RESOURCE USAGE (except WAL)
    #------------------------------------------------------------------------------
    
    # - Memory -
    
    shared_buffers = 1500MB			# min 128kB
    					# (change requires restart)
    #temp_buffers = 8MB			# min 800kB
    #max_prepared_transactions = 0		# zero disables the feature
    					# (change requires restart)
    # Note:  Increasing max_prepared_transactions costs ~600 bytes of shared memory
    # per transaction slot, plus lock space (see max_locks_per_transaction).
    # It is not advisable to set max_prepared_transactions nonzero unless you
    # actively intend to use prepared transactions.
    work_mem = 2MB				# min 64kB
    maintenance_work_mem = 256MB		# min 1MB
    #max_stack_depth = 2MB			# min 100kB
    
    # - Kernel Resource Usage -
    
    #max_files_per_process = 1000		# min 25
    					# (change requires restart)
    #shared_preload_libraries = ''		# (change requires restart)
    
    # - Cost-Based Vacuum Delay -
    
    #vacuum_cost_delay = 0ms		# 0-100 milliseconds
    #vacuum_cost_page_hit = 1		# 0-10000 credits
    #vacuum_cost_page_miss = 10		# 0-10000 credits
    #vacuum_cost_page_dirty = 20		# 0-10000 credits
    #vacuum_cost_limit = 200		# 1-10000 credits
    
    # - Background Writer -
    
    #bgwriter_delay = 200ms			# 10-10000ms between rounds
    #bgwriter_lru_maxpages = 100		# 0-1000 max buffers written/round
    #bgwriter_lru_multiplier = 2.0		# 0-10.0 multipler on buffers scanned/round
    
    # - Asynchronous Behavior -
    
    #effective_io_concurrency = 1		# 1-1000. 0 disables prefetching
    
    
    #------------------------------------------------------------------------------
    # QUERY TUNING
    #------------------------------------------------------------------------------
    
    # - Planner Method Configuration -
    
    #enable_bitmapscan = on
    #enable_hashagg = on
    #enable_hashjoin = on
    #enable_indexscan = on
    #enable_mergejoin = on
    #enable_nestloop = on
    #enable_seqscan = on
    #enable_sort = on
    #enable_tidscan = on
    
    # - Planner Cost Constants -
    
    #seq_page_cost = 1.0			# measured on an arbitrary scale
    #random_page_cost = 4.0			# same scale as above
    #cpu_tuple_cost = 0.01			# same scale as above
    #cpu_index_tuple_cost = 0.005		# same scale as above
    #cpu_operator_cost = 0.0025		# same scale as above
    effective_cache_size = 3000MB
    
    # - Genetic Query Optimizer -
    
    #geqo = on
    #geqo_threshold = 12
    #geqo_effort = 5			# range 1-10
    #geqo_pool_size = 0			# selects default based on effort
    #geqo_generations = 0			# selects default based on effort
    #geqo_selection_bias = 2.0		# range 1.5-2.0
    
    # - Other Planner Options -
    
    #default_statistics_target = 100	# range 1-10000
    #constraint_exclusion = partition	# on, off, or partition
    #cursor_tuple_fraction = 0.1		# range 0.0-1.0
    #from_collapse_limit = 8
    #join_collapse_limit = 8		# 1 disables collapsing of explicit 
    					# JOIN clauses
    
    
    
    # These settings are initialized by initdb, but they can be changed.
    lc_messages = 'Swedish_Sweden.1252'			# locale for system error message
    					# strings
    lc_monetary = 'Swedish_Sweden.1252'			# locale for monetary formatting
    lc_numeric = 'Swedish_Sweden.1252'			# locale for number formatting
    lc_time = 'Swedish_Sweden.1252'				# locale for time formatting
    
    # default configuration for text search
    default_text_search_config = 'pg_catalog.swedish'
    
    # - Other Defaults -
    
    #dynamic_library_path = '$libdir'
    #local_preload_libraries = ''
    
    
    #------------------------------------------------------------------------------
    # LOCK MANAGEMENT
    #------------------------------------------------------------------------------
    
    deadlock_timeout = 10s
    #max_locks_per_transaction = 64		# min 10
    					# (change requires restart)
    # Note:  Each lock table slot uses ~270 bytes of shared memory, and there are
    # max_locks_per_transaction * (max_connections + max_prepared_transactions)
    # lock table slots.
    
    
    #------------------------------------------------------------------------------
    # VERSION/PLATFORM COMPATIBILITY
    #------------------------------------------------------------------------------
    
    # - Previous PostgreSQL Versions -
    
    #add_missing_from = off
    #array_nulls = on
    #backslash_quote = safe_encoding	# on, off, or safe_encoding
    #default_with_oids = off
    #escape_string_warning = on
    #regex_flavor = advanced		# advanced, extended, or basic
    #sql_inheritance = on
    #standard_conforming_strings = off
    #synchronize_seqscans = on
    
    # - Other Platforms and Clients -
    
    #transform_null_equals = off
    
    
    #------------------------------------------------------------------------------
    # CUSTOMIZED OPTIONS
    #------------------------------------------------------------------------------
    
    #custom_variable_classes = ''		# list of custom variable class names
    My machine is a HP z800 with 6GB RAM and a quad-core processor (not that it matters, the postgresql processes only use one core).

    Typically, when running this query about 2GB RAM is utilized and 25&#37; CPU resources (100% of one core) according to the task manager performance monitor.
    Last edited by dan-erik; 01-18-10 at 07:08.

  3. #3
    Join Date
    Jan 2010
    Posts
    4

    Further update

    Another addition to this question...

    After a bit of analysis, it seems that the median calculation is super-slow compared min/max. Is there a faster way to calculate median than the one I am using? Are window functions very resource intensive?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •