Results 1 to 11 of 11
  1. #1
    Join Date
    Sep 2015
    Posts
    3

    Unanswered: redesigning tables

    Hi,

    I'm about to optimize my tables and I got one table where texts are stored. There can be very small texts (less than 100 chars) and very large texts (nvarchar(max)). So my idea in order to reduce database size is to split this text column from the main table and add a column with a trigger like a char count.

    This trigger would be used to decide if the text would be saved to n child tables referenced to by an id from parent table. e.g. one table for texts with less than 250 chars, another for up to 500, one for up to 750, (or another step value like 500, 1000, 1500...) instead of having thousands of rows with nvarchar(max).

    My question now is: Is it advisable to do such a splitting or would it be too much a strain from the queries side so that the reduced database size would not come into account?

    Thanks,

    Pascal

  2. #2
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    For the Microsoft SQL Server NVARCHAR data type, there is a clear break in storage/features that comes at 3999 characters.

    I would have two tables. Any string that will fit in 3999 characters or less ought to be stored in a table using NVARCHAR(3999). Any string longer than 3999 characters ought to be stored in a table using NVARCHAR(MAX). I can't see any benefit in any other column sizes.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  3. #3
    Join Date
    Sep 2015
    Posts
    3
    As I already wrote, I am already using nvarchar(max). The idea of splitting to various tables with lower nvarchar figures came from the fact that it is often usefull to use smallmoney instead of money or tinyint, smallint, int or bigint, depending on the maximum values you store. A tinyint using 1 byte where an int uses 4 bytes. So it can reduce database size which is recommended.
    So I thought having e.g. 1 million entries with nvarchar(500) and 1 million nvarchar(max) would then be better than storing them in 2 million nvarchar(max) entries as the byte usage should (is it?) then be far lower.

  4. #4
    Join Date
    Jan 2007
    Location
    UK
    Posts
    11,434
    Provided Answers: 10
    Correct me if I'm wrong, but when you use an [n]varchar(max) column in SQL Server, it is quite clever in terms of storage "under the hood".

    That is to say that strings that are up to the 8,000 byte limit (varchar(8000), nvarchar(4000)) are stored on the "standard" data page. However, if they exceed this threshold the data is stored out of page, with the data page containing a pointer to this "object". SQL Server handles this automagically for you.

    I can't really see a reason why you'd go against the DB engine on this and decide yourself how to store it. What is the justification?
    George
    Home | Blog

  5. #5
    Join Date
    Sep 2015
    Posts
    3
    Hi,

    Nobody said anything about splitting up strings myself or messing up with strings above nvarchar(4000) as they definitely belong into nvarchar(max) which is also the column setting I use.

    As written above I mean strings that only contain 100 chars would consume space for nvarchar(max) where they could be stored in a way smaller column which would reduce db size. The query for storing them into a specific table (with the according size e.g. 100 would go into nvarchar(500)) would be triggered on the aspx page using an if then statement.

    The question is if it makes sense to split into 9 tables of 500 char steps or if the strain to the db would consume the gain taken from a size reduced db.

  6. #6
    Join Date
    Jan 2003
    Location
    Massachusetts
    Posts
    5,799
    Provided Answers: 11
    GVee is correct on this matter. If you have an nvarchar(max) column in your table, the storage engine will attempt to store the data "inline" with the row. The Text datatype would always store the data on a separate 8kb LOB storage page. This amount of space used by nvarchar(max) will vary in some ways, such as whether the data was inserted without the nvarchar(max) values, and these were updated later on, or if they were inserted all from the beginning. Also, if you inserted the rows with say 1000 characters each, then later on updated them to all be 2000 characters each, you would see a lot of the nvarchar(max) values get migrated to LOB storage

    The multiple tables will become very hard to stitch back together, if these values are normally queried together (i.e. the data all describe the same things). If they are not queried together, then this brings up the question of why they were all stuffed in there in the first place. I am assuming of course that full-text indexing is not being used here. If it were, that would add a whole new layer of complexity that you may not want to deal with.

  7. #7
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Quote Originally Posted by gvee View Post
    Correct me if I'm wrong, but when you use an [n]varchar(max) column in SQL Server, it is quite clever in terms of storage "under the hood".
    You are correct, but I went a step further (which may or may not have been a good thing). This is all based on my understanding of things, and measurements that I made long ago on SQL 2008 R2, so these observations may not apply to newer storage engines.

    NVARCHAR(MAX) does use the data page for storage if it can, so in that respect it is quite efficient.

    However, in order to make it possible for NVARCHAR(MAX) "page spills" when the data in a row won't fit on one page, extra bookkeeping is needed. There is a performance hit (which can be material in some cases) associated with this bookkeeping, as well as the storage hit (which is small). By splitting the data at the 3999 character mark, I found that you could avoid the NVARCHAR(MAX) overhead associated with some data on page and some data off page.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  8. #8
    Join Date
    Jan 2003
    Location
    Massachusetts
    Posts
    5,799
    Provided Answers: 11
    I did a few (admittedly quick) experiments, and I see a couple of things going on (not all of which may be related to any topic previously mentioned in the thread, so apologies for my rambling up front).

    There are two kinds of "overflow" storage.

    LOB_DATA is where (n)varchar(max) data goes, when it gets too big for the page. This is an all or nothing move.
    ROW_OVERFLOW_DATA: Where data in finite variable columns is transferred to in case the row is updated to a total of more than 8060 bytes.

    I created these tables:
    Code:
    create table lobdata
    (col1 varchar(max))
    
    insert into lobdata values (replicate ('a', 8000))*
    
    create table overflowdata
    (col1 varchar(8000),
     col2 varchar(8000))
    
    insert into overflowdata (col1, col2) values (replicate ('a', 8000), null)
    Checking for these tables in memory shows that they both (as everyone expected) have exactly one data page each:
    Code:
    select *
    from sys.dm_os_buffer_descriptors d join
    	(select a.allocation_unit_id, object_name (p.object_id) as name, p.index_id
    	 from sys.partitions p join
    			sys.allocation_units a on p.partition_id = a.container_id
    	where p.object_id in (object_id('lobdata'), object_id('overflowdata'))) a on d.allocation_unit_id = a.allocation_unit_id
    where d.database_id = db_id()
    order by name
    Now we update both tables. This is where we start to see differences:
    Code:
    update lobdata
    set col1 = col1 + replicate ('b', 70)
    
    update overflowdata
    set col2 = replicate ('b', 70)
    Re-run the query on sys.dm_os_buffer_descriptors, and we see that both tables still have a single DATA_PAGE each, and both of them are largely empty (free_space_in_bytes column). Both tables have also gained one or more TEXT_MIX_PAGEs. Curiously, if you examine the pages under DBCC PAGE, the overflow example has copied the original page's contents (the 'a's) to a new page, and left the new data (the 'b's) in the data page. This seems to be more work than absolutely necessary, but it could be due to my contrived example, and if there were more columns in the table (e.g. various dates, IDs, and other stuff), it might make more sense.

    There are additional IAM_PAGES, but these are just allocation maps, and not likely to be allocated per row, but allocated per partition.

    Now, the point...I had a point in here, somewhere....Ahh, yes. Pat. Were you experimenting with nvarchar(3999) fields, or nvarchar(max) fields with the amount of data in them capped by the application/stored proc? And is the overhead you were talking about applicable to both ROW_OVERFLOW and LOB_DATA storage methods?



    * Interesting to note, REPLICATE returns only varchar(8000), so
    Code:
    select len (replicate('a', 8000) + replicate ('b', 8000))
    returns 8000

  9. #9
    Join Date
    Dec 2007
    Location
    Richmond, VA
    Posts
    1,328
    Provided Answers: 5
    I did something a little similar. In my case it was a comment table, I went with primary key, a sequence number and then a comment column of char 100. The application never had to do worry, as far as they knew it was a string of never ending length. I created a vew of the table which strung all the rows of comments into a single string, then used an instead of insert trigger on the view, which took the string that was passed it and broke whatever string was given into the 100 byte chunks to be stored in the table. I found it worked everywhere but DB2 on Z/OS.

    http://www.idug.org/p/fo/et/thread=43488
    Dave

  10. #10
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    The test was done on a VM for a former client, years ago. I don't have access to either the machine or the scripts any more.

    The use case was a hierarchical data store something like the Windows Registry, but implemented in SQL Server. Both the key and the value associated with the pairs could be huge (ranging from 1 to 30 Kb each). When we went to production, we capped the keys at 2010 characters, but the values were NVARCHAR(MAX).

    We loaded about 50 million rows into the test tables for several different limits for both key and value size. Some uses of both fixed length and (MAX) in various test databases. There were a number of performance anomalies for NVARCHAR(MAX) keys, and a smaller number of anomalies with NVARCHAR(MAX) values. We eventually traced the performance problems to paging, page faulting, etc. when trying to resolve whether the key value fit on the page or not. The ability to index the fixed length columns was a huge advantage, but for the tests that we ran we didn't include indexes until after we worked out the other problems we discovered.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  11. #11
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Quote Originally Posted by MCrowley View Post
    * Interesting to note, REPLICATE returns only varchar(8000), so
    Code:
    select len (replicate('a', 8000) + replicate ('b', 8000))
    returns 8000
    I meant to comment last night, but got sidetracked.

    This is a bullet that you can dodge:
    Code:
    DECLARE @foo TABLE
    (  bar			NVARCHAR(MAX)
    )
    
    SELECT DataLength(Replicate(Cast(N'C' AS NVARCHAR(MAX)), 10000))
    
    INSERT INTO @foo(bar)
       SELECT Replicate(Cast(N'D' AS NVARCHAR(MAX)), 10000)
    
    SELECT DataLength(bar), Left(bar, 20)
       FROM @foo
    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •