Results 1 to 8 of 8
  1. #1
    Join Date
    Mar 2004
    Posts
    5

    Unanswered: performance problems - need help urgently

    hi ,

    this is driving me nuts!!!!!!. i am running a perl script which takes records from a database, finds unique single words and two, three & four word collocations in them. then for each unique combination found it searches in the Mysql table , if the combination is found an INSERT is made into a frequency table , otherwise if the query returns empty set then the combination is added to the unique words table as well as the frequency table. all of the tables have the combination field declared as Varchar(x) BINARY.it is taking huge amounts of time to do this ? why? imagine this, it took me 8 hours to process 532 single line documents. something is grossly wrong here. can anybody help me with this? should i use INDEXES ? how?

  2. #2
    Join Date
    Oct 2003
    Location
    Slovakia
    Posts
    482

    Re: performance problems - need help urgently

    Originally posted by fadingjava
    hi ,

    this is driving me nuts!!!!!!. i am running a perl script which takes records from a database, finds unique single words and two, three & four word collocations in them. then for each unique combination found it searches in the Mysql table , if the combination is found an INSERT is made into a frequency table , otherwise if the query returns empty set then the combination is added to the unique words table as well as the frequency table. all of the tables have the combination field declared as Varchar(x) BINARY.it is taking huge amounts of time to do this ? why? imagine this, it took me 8 hours to process 532 single line documents. something is grossly wrong here. can anybody help me with this? should i use INDEXES ? how?
    Speed of queries depends on design of database.
    Learn how to create optimal database model.
    Learn when and where you can use indexes.

  3. #3
    Join Date
    Feb 2004
    Posts
    6
    Try outside of perl as well.

  4. #4
    Join Date
    Sep 2002
    Location
    Montreal, Canada
    Posts
    224
    Provided Answers: 1

    Re: performance problems - need help urgently

    Originally posted by fadingjava
    hi ,

    this is driving me nuts!!!!!!. i am running a perl script which takes records from a database, finds unique single words and two, three & four word collocations in them. then for each unique combination found it searches in the Mysql table , if the combination is found an INSERT is made into a frequency table , otherwise if the query returns empty set then the combination is added to the unique words table as well as the frequency table. all of the tables have the combination field declared as Varchar(x) BINARY.it is taking huge amounts of time to do this ? why? imagine this, it took me 8 hours to process 532 single line documents. something is grossly wrong here. can anybody help me with this? should i use INDEXES ? how?
    Can you provide us with the SHOW CREATE TABLE, SHOW INDEX and EXPLAIN of the queries involved as well as an example of what you're trying to do?

  5. #5
    Join Date
    Mar 2004
    Posts
    5

    Re: performance problems - need help urgently

    Originally posted by bstjean
    Can you provide us with the SHOW CREATE TABLE, SHOW INDEX and EXPLAIN of the queries involved as well as an example of what you're trying to do?
    i do not have any indexes on the table right now. but here is the show create table

    Code:
    mysql> show create table single_words;
    
    | single_words | CREATE TABLE `single_words` (
      `word_id` int(10) unsigned NOT NULL auto_increment,
      `word` varchar(30) binary default NULL,
      `in_stoplist` enum('y','n') default 'n',
      PRIMARY KEY  (`word_id`)
    ) TYPE=MyISAM |
    1 row in set (0.00 sec)
    
    mysql> show create table single_frequency;
    
    | single_frequency | CREATE TABLE `single_frequency` (
      `word` varchar(30) binary default NULL,
      `frequency` int(5) unsigned default NULL,
      `word_id` int(10) unsigned NOT NULL default '0',
      `doc_id` int(10) unsigned NOT NULL default '0'
    ) TYPE=MyISAM |
    
    1 row in set (0.00 sec)
    
    mysql> EXPLAIN select doc from doc where doc_id=5;
    +-------+-------+---------------+---------+---------+-------+------+-------+
    | table | type  | possible_keys | key     | key_len | ref   | rows | Extra |
    +-------+-------+---------------+---------+---------+-------+------+-------+
    | doc   | const | PRIMARY       | PRIMARY |       4 | const |    1 |       |
    +-------+-------+---------------+---------+---------+-------+------+-------+
    1 row in set (0.00 sec)
    
    mysql> EXPLAIN select word_id, word from single_words where word = 'I';
    +--------------+------+---------------+------+---------+------+------+-------------+
    | table        | type | possible_keys | key  | key_len | ref  | rows | Extra       |
    +--------------+------+---------------+------+---------+------+------+-------------+
    | single_words | ALL  | NULL          | NULL |    NULL | NULL | 1431 | Using where |
    +--------------+------+---------------+------+---------+------+------+-------------+
    1 row in set (0.01 sec)
    there are two categories of tables the "words" tables are repositories of unique word combinations found in a corpus, while the "frequency" tables contain the number of occurences of a particular combination in the indivdual documents. Each time a new document is processed , all the unique combinations are extracted and counted . if the combination exists in the "words" tables an INSERT is made into the "frequency" table. on the contrary if a combination is not found an INSERT is done into the "words" as well as "frequency" tables. the queries are simple INSERT and SELECT with WHERE option.

    an example would be processing a doc containning text say:

    "I need urgent help"

    that would yield "I","need","urgent","help" as unique single words in the document. i need to find if these already exist in the repositories. the same is done with two word, three word and four word collocations which have an exact table structure.

    hope this my problem is clear enough now.

    thanks in adavance

  6. #6
    Join Date
    Sep 2002
    Location
    Montreal, Canada
    Posts
    224
    Provided Answers: 1

    Re: performance problems - need help urgently

    Originally posted by fadingjava
    i do not have any indexes on the table right now. but here is the show create table

    Code:
    mysql> EXPLAIN select word_id, word from single_words where word = 'I';
    +--------------+------+---------------+------+---------+------+------+-------------+
    | table        | type | possible_keys | key  | key_len | ref  | rows | Extra       |
    +--------------+------+---------------+------+---------+------+------+-------------+
    | single_words | ALL  | NULL          | NULL |    NULL | NULL | 1431 | Using where |
    +--------------+------+---------------+------+---------+------+------+-------------+
    1 row in set (0.01 sec)
    Well, from what I see here an index on word would help this query... Right now it performs a full table scan... Probably a prefix index would suffice... Probably no need to index the full column... Something like the following would help:

    CREATE INDEX word_idx ON single_words (word(7));

    But what you do with your data is still unclear... Could you post a full simple example of data and all queries you execute to process the data?

    Hope this helps.

  7. #7
    Join Date
    Mar 2004
    Posts
    5
    Code:
    	
    #connect to the databases	
    	my $dbh1 = DBI->connect("dbi:mysql:corpus_stats","sid" , $password);
    	my $dbh = DBI->connect("dbi:mysql:concordance","sid" , $password);
    	
    #select max,min record id from document table for using as loop counter 
    	my @id  = $dbh1->selectall_arrayref("select min(rec_id),max(rec_id) from scotts_data");
    	
    	batch : for ($k= ($id[0][0][0]);$k<= ($id[0][0][1]);$k++) {	
    #select the particular records corresponding to loop counter
    	
    my @string = $dbh1->selectall_arrayref("select rec_id,data from scotts_data where rec_id = $k");
    	
    	$doc = "$string[0][0][1]";
    
    #split the doc in indivdual words and pass through regex, build required collocations and find unique combinations	
    	
    @single = split(' ',$doc);
    
    	foreach (@single)
    	{
    		$_=~ s/[!^~#&*_+=|?,'.;:]//g;
    	}
    
    	for ($i = 0; $i <= ($#single-1); $i++)
    	{
    		push(@double,$single[$i]." ".$single[$i+1]);
    	}   
    
    	for ($i = 0; $i <= ($#single-2); $i++)
    	{
    		push(@triple,$single[$i]." ".$single[$i+1]." ".$single[$i+2]);
    	}   
    	
    	for ($i = 0; $i <= ($#single-3); $i++)
    	{
    		push(@quad,$single[$i]." ".$single[$i+1]." ".$single[$i+2]." ".$single[$i+3]);
    	}   
    
    	@single_uniq = grep { ! $seen{$_} ++ } @single;
    	@double_uniq = grep { ! $seendouble{$_} ++ } @double;
    	@triple_uniq = grep { ! $seentriple{$_} ++ } @triple;
    	@quad_uniq = grep { ! $seenquad{$_} ++ } @quad;
    	
    
    
    # for each unique element element select word,word_id from words table
    	
    single:for ($j=0;$j<=$#single_uniq;$j++){
    	
    		$frequency = ($seen{$single_uniq[$j]});
    		my @row = $dbh->selectall_arrayref("select word,word_id from single_words where word = '$single_uniq[$j]'");	
    		
    #if query returns NULL then insert value of unique element and its frequency in words and frequency tables
    
    		if($row[0][0][0] eq "")
    			{
    			$dbh->do ("insert into single_words values('','$single_uniq[$j]','n')");
    			my $max1 = $dbh->selectall_arrayref("select max(word_id) from single_words");
    			$dbh->do ("insert into single_frequency values('$single_uniq[$j]',$frequency,'$max1->[0][0]',$string[0][0][0])"); 			
    			}			
    #if the a value is returned then insert value of unique element and its frequency in only the frequency table
    		else
    			{
    			$dbh->do ("insert into single_frequency values('$row[0][0][0]',$frequency,'$row[0][0][1]',$string[0][0][0])");
    			}
    		
    undef @row;		
    
    	}


    here is the part of perl script which deals with single word combinations. it is basically a part of a concordance application used for corpus analysis.

  8. #8
    Join Date
    Mar 2004
    Posts
    5
    hi,

    bstjean , in one of the threads about a similar questions you suggested:

    simply put a UNIQUE INDEX on the field that represents the records unique value and just perform an INSERT...the record would simply would not update.......seems to save time
    when i do this , it definitely raises an error about duplication of keys. would this way be good to use when i am inserting large volumes of data with possiblity that there might be duplication of key? how can i just ignore this error?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •