SQL: When to Avoid Index?

**alphaoide** · 12-02-2006

I need help understanding the following.

Consider the query:

Code:

SELECT ... FROM Orders WHERE Unpaid _Flag = 'Y'

This query has (we would hope) high selectivity, being true for a small faction of the complete history of orders. If you can count on finding a value of 'Y' for Unpaid_flag in the query, you might want an index on that column. But if the query is part of a group that just as often searches for the unselective condition Unpaid_Flag='N', you should likely avoid the index. In this example, the meaning of the flag is likely special to the query, driving the very purpose of the query (to find bills to send out), so you can count on finding primarily queries agains 'Y', which is the rare value.

I still don't understand why I should avoid index on small selectivity query. Thanks for the enlightment beforehand.

**Mario F.** · 12-02-2006

I would expect that column to have high density and low selectivity. Not because I don't trust humankind, mind you

But simply because a column based on only two possible values is necessarily an high density column. What determines the selectivity of a column is its density, not the after-effect of having it queried.

Choosing or not to place an index on that column though should be made based also on other info. Personally I believe the higher the density, the least of a good candidate that column is for an index. A unique index, for instance, has the lowest of densities (consequently, the highest of selectivities). That is the "perfect" index as many execution_plans will tell you.

Even if you know your data (you shouldn't!) and expect Y to be very rare, you still shouldn't index the column. You will see that you will still get little, if any improvement. With such a high density (2 possible values), selectivity is very high. The query optimizer will know this (even if it is not of a boolean type as long as you placed any constraints for that column) and most probably disregard the index and make a table scan instead.

Remember that the optimizer will not do the query to determine selectivity. That would be senseless. It might as well return the results then and move on. It will instead use the column data type, the current number of records, the column constraints and probably even some cached data to determine this and then decide or not for an index usage.

EDIT: A good test is to execute that query and read the execution_plans with an index and without an index. Chances are you will see on both cases a table scan. No matter, how many 'Y' there are. Meanwhile, with the index, you decreased overall performance, since you slowed down inserts and updates

**alphaoide** · 12-02-2006

Code:

MS SQL 2000
Rows with unpaid_flag = 'Y' is 1
Rows with unpaid_flag = 'N' is 98201

With Index
select * from orders
where unpaid_flag = 'Y'

Table 'Orders'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.


select * from orders
where unpaid_flag = 'N'

Table 'Orders'. Scan count 1, logical reads 289, physical reads 0, read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 47 ms,  elapsed time = 929 ms.



Without index
select * from orders
where unpaid_flag = 'Y'

Table 'Orders'. Scan count 1, logical reads 141, physical reads 0, read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 38 ms.


select * from orders
where unpaid_flag = 'N'


Table 'Orders'. Scan count 1, logical reads 141, physical reads 0, read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 172 ms,  elapsed time = 652 ms.

So it seems...
Searching 'Y' with index is faster
Searching 'N' without index is faster

Now, my original question was why searching 'N' without index is faster. I think it's because an index could only cover about 300 rows??

I was mistakenly thinking that when you indexed that column, you would get an index 'Y' which will cover ALL the 'Y's rows (1 row) and an index 'N' which will cover ALL the 'N's rows (98201 rows).

**Mario F.** · 12-02-2006

The problem is with the *.
If you do that same query on just the indexed column then it will be much faster.

Consider why the SQL optimizer may not choose an index... Have you ever thought of that? Why on earth would it choose not to use one when it seems obvious that, despite any discussion on data density and selectivity, there is always some kind of performance increase by selecting data on an ordered where clause?

The answer is IO (the most performance hitting aspect of queries) and the fact non-clustered indexes don't store the whole row, just the column value and its row position on the table. The remaining of that row data has to be fetched from the table. So the process for a query like yours is fetch index location index table (uses IO) -> move to that location on table and fetch remaining columns (uses IO). And then repeat this for every other row on your result.

A table scan will be much faster in this case. The whole operation will occur inside the table. There's less IO involved. Your optimizer, if it detects this, will not select the index and prefer a table scan.

EDIT: On a side note. Don't use final times to estimate execution speed. It's a false perspective (especially if your server engine uses cached data). You want to run an execution plan instead and look at the operations involved and their relative times.

Thread: SQL: When to Avoid Index?

Thread Tools

Search Thread

Display

SQL: When to Avoid Index?

Similar Threads

20q game problems

Reiventing the wheel (again)

Embedded SQL

Function to check memory left from malloc and free?

file index update error