Results 1 to 5 of 5
  1. #1
    Join Date
    Jul 2014
    Posts
    7

    Unanswered: Absolute DB newbie has to switch from CSV database to SQL (performance-critical)

    Hi everyone, first post

    Quick introduction: I've developed an app for someone that outputs data in a CSV file. The input data is huge and so is the CSV. The number of lines (records) may (and will) go to tens of millions or more. New feature requests arrive and they make it hard for me to stay with the CSV approach and it begs for a real DB. BUT I do not have DB experience. My first and only DB formal training was at the university 15 years ago and I've had NO contact or interest in databases meanwhile.
    The CSV contains lines consisting of constant numbers of tokens (9), of which 3 are large hex integers of fixed length and the rest are varied length strings (relatively small).

    The questions:

    1. The app must be able to find duplicates (key is one of the hex integers) in the existing DB VERY fast. My app is written in C and uses binary search trees and other custom optimized methods, so you can guess it's fast. Can I get similar speed with a DB? If so, how difficult is it?

    2. Question 1 makes it obvious that the insertion of items must be fast too (right now I'm limited by the speed of the hard drive, with buffering hundreds of thousands of records/s are obviously possible but a realistic case is in the hundreds/s). I guess that's not an issue, or is it?

    3. How difficult do you estimate the task to be? I'm an IT professional with formal training and I think myself to be quite versatile but I need to understand if there are devils lurking.

    4. Is there a quick and dirty route to this? Can you recommend some tutorial for the DB-challenged? Note that I'm on a bit of a schedule.


    Please, please, if you intend to give advice along the lines of "just hire someone to do it", just don't. It doesn't answer the question, I will not do it and these answers tend to attract "like" types of posts, quickly derailing the topic.

  2. #2
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Without knowing more specific details I can't give you iron-clad estimates, but what you want is quite possible.

    A database engine that supports hash tables ought to be able to give you 10-15 times faster query response than a binary search of ten million rows. The exact details will depend on the database engine, the hardware, and the schema.

    The constraints you've put in your request remove several of the quickest options from consideration, but you can still get good performance from databases using the choices that are still available.

    There are quick and dirty routes, but particularly if you haven't got any experience those are likely to be expensive. If you have the budget to permit them, there are many choices.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  3. #3
    Join Date
    Jul 2014
    Posts
    7
    Hi

    I'm willing to give any details needed. I'm intending to use MySQL.
    Can you detail a bit on "the constraints you've put in your request remove several of the quickest options from consideration"?

  4. #4
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    MySQL has good but somewhat limited hashing abilities. This can be used to significantly improve performance and may be adequate for your needs. Other database engines (particularly DB2 and Microsoft SQL Server) offer better support for very high access speed requirements.

    You need to test your own data to see what level of performance you need. A simple index may be sufficient. If indexing isn't enough, you may need to investigate hardware solutions like SSD or even an In-memory-database like Hecaton in Microsoft SQL Server.

    Without knowing a great deal about your problem, I can't give you specific answers. A database driven by an application architected to use the tools the database makes available ought to be able to solve most problems in 10-25% of the time needed to solve the same problem using a CSV and brute force computing. In some cases, the database solution may be much faster than that.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  5. #5
    Join Date
    Jul 2014
    Posts
    7
    OK, I'll try to give all the details I find relevant.

    The app constantly receives items which it processes and generates data from (the integers and strings I was mentioning). It gets a CSV as input and it updates it as more items arrive. It needs to determine duplicates in the already existing CSV but for that I'm building an index and some other data structures (Bloom filters, so yes, I'm using hashes too), one-time operations at the beginning. RAM is not an issue (the client's machines have 32G of RAM). So I'm not constantly seeking through the file, nothing like that.

    Secondly, I need to determine duplicates in the items as they arrive. That's why I'm using the binary search trees, can't use an index for data which is yet to arrive. I'm keeping binary trees for items already written in the CSV, no way I'm rereading from the file.

    So what I need to do is get an already existing DB as an input, update it with more data as input items arrive and determine items that are DB duplicates and/or duplicates of current run ones. The rate of items/s is normally in the hundreds tops on my mediocre machine but I need to write for performance, as that may orders of magnitude higher on client's machines (SSDs, network drives etc). My duplicate search and DB update speed needs to be negligible compared to the speed items arrive.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •