Results 1 to 4 of 4
  1. #1
    Join Date
    Aug 2011
    Posts
    2

    Building a Financial Database - need some guidance

    Hi,

    I am building a database that will be populated with financial data. Taking into account what the data looks like, how the data will be organised, what I am trying to do with this data, my hardware, budget and competency, I am looking for the most efficient design for super fast querying of this data (not concerned with writes). I am hoping to get some guidance as to how I should progress.

    What the Data Looks Like
    I have effectively collected monthly stock data over a 25 year period. So for every stock in the world, I have collected its associated financial data (approximately 50 mostly numeric fields) at a monthly frequency over a 25 year period.

    e.g.

    Microsoft, May 2000 -> associated financial data (50 fields)
    Cisco, May 2000 -> associated financial data (50 fields)
    Microsoft, June 2000 -> associated financial data (50 fields)
    Cisco, June 2000 -> associated financial data (50 fields)
    Microsoft, July 2000 -> associated financial data (50 fields)

    How the Data will be organised
    The data will be organised into one large table. Initially there will be approximately 3 million rows and 50 columns. The database will grow by approximately 15,000 rows per month.

    What I am trying to do with this data
    The data will used in a few different ways:

    1. Time Series analysis where I will be retrieving some or all the financial data for a stock over the entire 25 year period e.g. I will query for all Microsoft's financial data over a 25 year period at monthly intervals. I may perform this sort of query quite frequently especially when I request these data arrays for a group of stocks e.g. I will query for some or all the financial data over the 25 year period at monthly intervals for Stocks A, B, C, D and E.

    2. Aggregation of columns where I will be aggregating financial data for a group of stocks over its entire 25 years history. For example, I may want to aggregate the financial data for all stocks that meet a particular criteria (such as being located in a specific region, size above a certain amount, similar types of businesses, highly profitable business etc) at monthly intervals over a 25 years period. These queries are likely to be quite dynamic so I will never know how they will be aggregated in advance.

    3. Same as 1 & 2 but at a point in time only. e.g. give me all the financial data for Stock A as of May 2001 or lets aggregate all the financial data of all Japanese companies in June 2011.

    I would like to be able to perform these queries very, very fast e.g. in milliseconds.

    Hardware
    Dual Core Xeon Server 3.5gHz 3GB RAM.

    Budget
    Free open source technology.

    Competency
    I am a good programmer (Java, C, C++, .NET and PHP), but know little about database design beyond basic MySQL/SQL Server. My experience in database is limited to SQL type databases but I am willing to try something new if it will serve my purpose.

    I have tried to do this before using one large table and MySQL. I have used both MYISAM and Infobright storage engines. Infobright is good but when you're pulling back time series data over 50 columns it will provide little benefit over MYISAM. When you're aggregating over 50 columns it's significantly faster, but still takes a few seconds. I have been reading about in-memory solutions but have no idea how to implement these (also, I have no idea how much memory I would need to place a table that is 3 million rows long and 50 columns wide in-memory). I have also been reading about Map/Reduce solutions but am not sure I will be able to get the blazing fast aggregate query results (especially since I only have one machine and not a cluster of nodes). Does anyone have any advice?

    Rameez
    Last edited by rameezsadikot; 08-08-11 at 06:15.

  2. #2
    Join Date
    Dec 2007
    Location
    London, UK
    Posts
    741
    Your data set appears to be well within the capabilities of all the major DBMSs. Your data is only at monthly level so why are millisecond response times important to you? If the issue is really one of handling a heavy query workload (thousands of simultaneous queries) then the choice of DBMS is probably less important than the middle tier components and the way you cache those results. Have you considered using an OLAP or reporting engine?

    Free open source technology.
    There is no such thing. A no-cost licence isn't always the cheapest option.

  3. #3
    Join Date
    Aug 2011
    Posts
    2
    Quote Originally Posted by dportas View Post
    Your data set appears to be well within the capabilities of all the major DBMSs. Your data is only at monthly level so why are millisecond response times important to you? If the issue is really one of handling a heavy query workload (thousands of simultaneous queries) then the choice of DBMS is probably less important than the middle tier components and the way you cache those results. Have you considered using an OLAP or reporting engine?



    There is no such thing. A no-cost licence isn't always the cheapest option.
    Thanks mate. Well millisecond response times are important because the results of these queries will feed into one of our trading systems.

    There won't be thousands of simultaneous queries, more likely one every second or so. I am planning on using something like memcached to store the results of any commonly queried results.

    The problem I am having with ordinary DMBS solutions is the aggregation of multiple columns. When it comes to aggregating the data in 50 columns for a group of stocks every month over 25 years it takes quite a while. I can't pre compute these aggregates because these conditions that these aggregates are based on will be very dynamic.

    I haven't considered OLAP. Can you suggest a few popular names? Are OLAP systems ideal for my aggregation problem?

    Thanks.

  4. #4
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    I'm hazarding guesses based on that little information you've provided. As you provide more information, I would expect that I can make better decisions and therefore may give you different advice.

    The kind of problem you are describing sounds a lot like trade diagnostics looking for leading indicators, the kind of information that fiduciary actuaries provide for portfolio managers and other investors.

    There are a number of products that are specifically designed to help with managing and manipulating that kind of data to produce those kinds of results. These packages are all proprietary, pretty expensive (approaching $500,000 USD for software), and require significant expertise to use.

    You can use commonly available software and a bit of inginuety to produce almost exactly the same results. The pacakge I'm most familiar with is the combination of Microsoft SQL Server, Analysis Services, and Reporting Services. The applications produced using this combination of tools can produce results in weeks that match or exceed the results of pacakges costing tenfold and taking years to develop.

    It is probably possible to create similar results using an Open Source tool set, but it would be more difficult and certainly more time consuming.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •