SQL: When to Avoid Index?

This is a discussion on SQL: When to Avoid Index? within the Tech Board forums, part of the Community Boards category; I need help understanding the following. Consider the query: Code: SELECT ... FROM Orders WHERE Unpaid _Flag = 'Y' This ...

  1. #1
    unleashed alphaoide's Avatar
    Join Date
    Sep 2003
    Posts
    696

    SQL: When to Avoid Index?

    I need help understanding the following.

    Consider the query:

    Code:
    SELECT ... FROM Orders WHERE Unpaid _Flag = 'Y'
    This query has (we would hope) high selectivity, being true for a small faction of the complete history of orders. If you can count on finding a value of 'Y' for Unpaid_flag in the query, you might want an index on that column. But if the query is part of a group that just as often searches for the unselective condition Unpaid_Flag='N', you should likely avoid the index. In this example, the meaning of the flag is likely special to the query, driving the very purpose of the query (to find bills to send out), so you can count on finding primarily queries agains 'Y', which is the rare value.
    I still don't understand why I should avoid index on small selectivity query. Thanks for the enlightment beforehand.
    source: compsci textbooks, cboard.cprogramming.com, world wide web, common sense

  2. #2
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Portugal
    Posts
    7,474
    I would expect that column to have high density and low selectivity. Not because I don't trust humankind, mind you But simply because a column based on only two possible values is necessarily an high density column. What determines the selectivity of a column is its density, not the after-effect of having it queried.

    Choosing or not to place an index on that column though should be made based also on other info. Personally I believe the higher the density, the least of a good candidate that column is for an index. A unique index, for instance, has the lowest of densities (consequently, the highest of selectivities). That is the "perfect" index as many execution_plans will tell you.

    Even if you know your data (you shouldn't!) and expect Y to be very rare, you still shouldn't index the column. You will see that you will still get little, if any improvement. With such a high density (2 possible values), selectivity is very high. The query optimizer will know this (even if it is not of a boolean type as long as you placed any constraints for that column) and most probably disregard the index and make a table scan instead.

    Remember that the optimizer will not do the query to determine selectivity. That would be senseless. It might as well return the results then and move on. It will instead use the column data type, the current number of records, the column constraints and probably even some cached data to determine this and then decide or not for an index usage.

    EDIT: A good test is to execute that query and read the execution_plans with an index and without an index. Chances are you will see on both cases a table scan. No matter, how many 'Y' there are. Meanwhile, with the index, you decreased overall performance, since you slowed down inserts and updates
    Last edited by Mario F.; 12-02-2006 at 09:52 AM.
    The programmer’s wife tells him: “Run to the store and pick up a loaf of bread. If they have eggs, get a dozen.”
    The programmer comes home with 12 loaves of bread.


    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  3. #3
    unleashed alphaoide's Avatar
    Join Date
    Sep 2003
    Posts
    696
    Code:
    MS SQL 2000
    Rows with unpaid_flag = 'Y' is 1
    Rows with unpaid_flag = 'N' is 98201
    
    With Index
    select * from orders
    where unpaid_flag = 'Y'
    
    Table 'Orders'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0.
    
    SQL Server Execution Times:
       CPU time = 0 ms,  elapsed time = 0 ms.
    
    
    select * from orders
    where unpaid_flag = 'N'
    
    Table 'Orders'. Scan count 1, logical reads 289, physical reads 0, read-ahead reads 0.
    
    SQL Server Execution Times:
       CPU time = 47 ms,  elapsed time = 929 ms.
    
    
    
    Without index
    select * from orders
    where unpaid_flag = 'Y'
    
    Table 'Orders'. Scan count 1, logical reads 141, physical reads 0, read-ahead reads 0.
    
    SQL Server Execution Times:
       CPU time = 31 ms,  elapsed time = 38 ms.
    
    
    select * from orders
    where unpaid_flag = 'N'
    
    
    Table 'Orders'. Scan count 1, logical reads 141, physical reads 0, read-ahead reads 0.
    
    SQL Server Execution Times:
       CPU time = 172 ms,  elapsed time = 652 ms.
    So it seems...
    Searching 'Y' with index is faster
    Searching 'N' without index is faster

    Now, my original question was why searching 'N' without index is faster. I think it's because an index could only cover about 300 rows??

    I was mistakenly thinking that when you indexed that column, you would get an index 'Y' which will cover ALL the 'Y's rows (1 row) and an index 'N' which will cover ALL the 'N's rows (98201 rows).
    source: compsci textbooks, cboard.cprogramming.com, world wide web, common sense

  4. #4
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Portugal
    Posts
    7,474
    The problem is with the *.
    If you do that same query on just the indexed column then it will be much faster.

    Consider why the SQL optimizer may not choose an index... Have you ever thought of that? Why on earth would it choose not to use one when it seems obvious that, despite any discussion on data density and selectivity, there is always some kind of performance increase by selecting data on an ordered where clause?

    The answer is IO (the most performance hitting aspect of queries) and the fact non-clustered indexes don't store the whole row, just the column value and its row position on the table. The remaining of that row data has to be fetched from the table. So the process for a query like yours is fetch index location index table (uses IO) -> move to that location on table and fetch remaining columns (uses IO). And then repeat this for every other row on your result.

    A table scan will be much faster in this case. The whole operation will occur inside the table. There's less IO involved. Your optimizer, if it detects this, will not select the index and prefer a table scan.

    EDIT: On a side note. Don't use final times to estimate execution speed. It's a false perspective (especially if your server engine uses cached data). You want to run an execution plan instead and look at the operations involved and their relative times.
    Last edited by Mario F.; 12-02-2006 at 12:29 PM.
    The programmer’s wife tells him: “Run to the store and pick up a loaf of bread. If they have eggs, get a dozen.”
    The programmer comes home with 12 loaves of bread.


    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. 20q game problems
    By Nexus-ZERO in forum C Programming
    Replies: 24
    Last Post: 12-17-2008, 04:48 PM
  2. Reiventing the wheel (again)
    By Elysia in forum C++ Programming
    Replies: 5
    Last Post: 12-02-2007, 02:26 AM
  3. Embedded SQL
    By sarac in forum C Programming
    Replies: 1
    Last Post: 05-04-2006, 09:09 AM
  4. Function to check memory left from malloc and free?
    By Lechx in forum C Programming
    Replies: 4
    Last Post: 04-24-2006, 05:45 AM
  5. file index update error
    By daluu in forum C Programming
    Replies: 1
    Last Post: 04-28-2003, 02:47 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21