Results 1 to 11 of 11
  1. #1
    Join Date
    May 2013
    Posts
    5

    Question Unanswered: Speeding up static table with 20M rows

    Hi!

    I have a IP location lookup table with 20 million rows. I can perform about 1 sequential lookup per second and I need to get it up to about 50 per second.

    Code:
    CREATE TABLE `ip_lookup` (
      `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
      `start_ip_int` bigint(20) NOT NULL,
      `end_ip_int` bigint(20) NOT NULL,
      `country_name` varchar(50) CHARACTER SET latin1 DEFAULT NULL,
      `city_name` varchar(50) CHARACTER SET latin1 DEFAULT NULL,
      `custom_field1` varchar(30) CHARACTER SET latin1 DEFAULT NULL,
      `custom_field2` varchar(20) CHARACTER SET latin1 DEFAULT NULL,
      `custom_field3` varchar(30) CHARACTER SET latin1 DEFAULT NULL,
      PRIMARY KEY (`id`),
      KEY `start_ip_int` (`start_ip_int`),
      KEY `end_ip_int` (`end_ip_int`),
    ) ENGINE=InnoDB AUTO_INCREMENT=18245826 DEFAULT CHARSET=utf8;
    I am processing a large number of historical web server logs. I can do queries in parallel (e.g. 10 processes all getting 1 query per second) but parallelising the log processing code is a much bigger task than I can handle right now.

    Maxmind supply a compiled lookup tool and their database in binary format. Lookups using that are lightning fast.

    How can I create something similar for an arbitrary dataset? I'd like to load the 20M rows into some magic binary indexer and get the code to make API calls to some daemon instead of MySQL lookups. I can't use Maxmind - the data I have is customised heavily.

    A lookup is simply:

    Code:
    select country_name, custom_field1, custom_field2 from ip_lookup where start_ip_int >= inet_aton('74.125.28.99') and inet_aton('74.125.28.99') <= end_ip_int limit 1
    Theoretically, I could split up the 20M rows by netblock and store them as text files on disk and have a lookup table in the code to tell it which folder to look in. But there must be a better way!

    I'd be very grateful for any advice. Thanks!

  2. #2
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    I would try to change start_ip_int and end_ip_int from bigint(20) to int unsigned. This should provide two benefits by eliminating a needless type conversion and reducing the size of the columns in the index. Both should speed up the query significantly.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  3. #3
    Join Date
    May 2013
    Posts
    5
    Hi Pat,

    Thanks for the suggestion. I tried that and it probably helped but not noticeably when i'm doing thousands of lookups sequentially at a time.

    I suspect MySQL is not the best here, but I was hoping someone would know of a system I can export the MySQL tables into that can do it.. I know its possible because Maxmind and similar software do it but I guess they heavily customise the software to the data they have.

    Thanks

  4. #4
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    Provided Answers: 59
    Quote Originally Posted by zpool View Post
    Hi Pat,

    Thanks for the suggestion. I tried that and it probably helped but not noticeably when i'm doing thousands of lookups sequentially at a time.

    I suspect MySQL is not the best here, but I was hoping someone would know of a system I can export the MySQL tables into that can do it.. I know its possible because Maxmind and similar software do it but I guess they heavily customise the software to the data they have.

    Thanks
    MySQL ISAM tables are supposed to be very very fast on static data
    Can you explkain your process...
    its the thousands of lookups sequentially that concerns me
    I'd rather be riding on the Tiger 800 or the Norton

  5. #5
    Join Date
    May 2013
    Posts
    5

    Speeding up static table with 20M rows

    MySQL ISAM tables are supposed to be very very fast on static data
    Can you explkain your process...
    its the thousands of lookups sequentially that concerns me
    I have a script which opens a connection to the database, then works through web-server log files. For each new IP it sees, it performs the query:
    Code:
    select country_name, custom_field1, custom_field2 from ip_lookup where start_ip_int >= inet_aton('74.125.28.99') and inet_aton('74.125.28.99') <= end_ip_int limit 1
    If I switch out the MySQL query with a API call to the binary maxmind database everything is very fast. Unfortunately I need some custom fields assigned to each IP netblock so need to query my table, not maxmind. But they have proved it's possible to speed up.

    It's actually a lot more than thousands (more like hundreds of thousands) as i'm back processing a lot of data.

    Any ideas?
    Thanks !

  6. #6
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    Provided Answers: 59
    As its historical data presumably you don't NEED the data right now
    I'd be tempted to stuff the log files into a table/s then do your processing as an update query. a single query doing a mass update rather than a series sequential select.
    its a one time cost rather than a series of individual queries
    I'd rather be riding on the Tiger 800 or the Norton

  7. #7
    Join Date
    May 2013
    Posts
    5
    Thanks for the suggestion. The end result needs to be in files, so the process of loading it all into MySQL just to get the geo fields added would be more trouble that it's worth I think.

    Using the API call to a compiled daemon like Maxmind is perfect except I don't get my custom fields out of it. I think I am looking for some non MySQL solution here. I posted to this forum though because the data is presently in MySQL format - I just need faster sequential lookups than I think MySQL is ever going to give me.

  8. #8
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    Provided Answers: 59
    Im not convinced its a database issue. It has the feel of an algolrythm issue to me
    I'd rather be riding on the Tiger 800 or the Norton

  9. #9
    Join Date
    May 2013
    Posts
    5
    Hi Healdem,

    The algorithm is simple. Perform hundreds of thousands of IP lookups (using that select) one after another. I don't lookup the same IP twice. If I switch to an API call to a binary DB, the problem is solved.

    There are probably some efficiencies to be made in the way I process logs, but that's not really my question. I want to either massively speed up sequential queries in MySQL or look for some way to build a binary DB customised to search the data I have.

    Thanks!

  10. #10
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Can you post your ip table contents and a representative sample of your query data? I'm pretty sure that I can process a hundred thousand rows in a few minutes or less but there may be something funky in your data that makes that abnormally slow.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  11. #11
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    Provided Answers: 59
    I would expect that a server with adequate memory would gradually load the table into memory if that table is being hammered by something.

    Im surprised you are pulling country name + two custom columns from a table linked to an IP address. not knowing what the custom fields are its hard to say if the design is 'right'

    Id also change the table type to MyISAM from INNODB. MyISAM is claimed to be faster than INNODB on static data
    I'd rather be riding on the Tiger 800 or the Norton

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •