SQL Server 2000 Design and Implementation, Part V -
Creating and Maintaining Indexes...
This series of articles shall focus on the data store you use for your .NET applications. In particular it shall consider relational database management systems in general for the first article and SQL Server 2000 in particular for the remaining.
This series of articles is focussing primarily on the physical data store Microsoft developers are most likely to use for their applications, SQLServer. The main reference for this series of articles is MCSE: SQLServer 2000 Design by Israel and Jones.
Here's a quick overview of what weve covered so far as well as what we shall be covering:
Article I: we started with developing a logical data model, before looking at how we go about implementing the physical database in SQLServer 2000.
Article II: creating and maintaining databases, in particular the files that comprise the database and how they can be split into filegroups and why they should be on occasion.
Article III: creating and maintaining tables - looking at what we can store in tables, and how we do so.
Article IV: data integrity - default values, check constraints and rules, primary keys, unique constraints, as well as foreign keys and relationships.
Article V: indexes and the related area of statistics.
Article VI: creating and maintaining the various database objects that exist- views, stored procedures, transactions, user defined functions and triggers. This concludes the look at the physical design per se. The remainder of the articles look at issues relating to data access.
Article VII: accessing data
Article VIII: modifying data
Article IX: importing and exporting data
Article X: locking
Article XI: security
Article XII: analysing and optimising data access
Thus far we've reached article V: creating and maintaining indexes.
Indexes speed data access. To investigate how they achieve this, let's consider how SQLServer searches data without indexes first. We know from a previous article in this series that data is stored in 8KB pages. Initially, while the table has no index, the table is called a heap. Rows are not stored in any particular order. When you need to access data in the heap SQL Server will access the whole table via a table scan the whole table may be scanned for the matching rows of a select, for example.
Indexes are used to:
However, they have their disadvantages. Firstly, they consume a lot of disk and memory space. Each time you create an index it will store all index keys, ordered in ascending or descending order; the larger the key, the bigger the index. Secondly, they cause slower inserts and may cause slower updates and deletes, though not necessarily.
In SQLServer 2000 indexes are stored as B-Trees where B stands for balanced. 'Balanced' as every branch in the tree has the same length. SQLServer creates a number of pages of the data to be indexed, with the data ordered so that if it needs to find a particularly key value it will know which index page it is in. The value in the index points to the actual record and the data can be retrieved.
Some terminology: the root of an index is made of one page containing the first keys referenced in the pages of the following level. This is the non-leaf level or intermediate level. The level below is the leaf level that contains all the sorted key values that reference the real position of the record.
While all indexes have the same tree structure, SQL Server uses two types of indexes: clustered and non-clustered.
A cluster in this sense is an index mixed with a table the table is part of the index. Once a table has a clustered index the data is both stored and sorted on the index key. Thus the leaf level of the index is the actual data of the table, thus meaning data can be retrieved in one less step within the indexing model.
This is why you can only create one clustered index the table is part of the index. A clustered index is a unique index every key should be unique. If duplicates exist in the index they are made unique by the internal addition of a counter that makes every key unique. A record is always located by its row ID or by its clustered key.
Non-clustered indexes have a leaf level that contains all the key values sorted in the same manner as the index is defined, along with the row ID or clustered index key. The actual data can then be retrieved with this information, as per our initial discussion of indexes. You can see that clustered keys should be kept as short as possible as it will be used as a row locator and will be stored at the leaf level of every non-clustered index.
An index can be created based on two or more columns. The only restriction is that the index key has to be less than 900 bytes.
A unique index enforces entity integrity it guarantees that every value is unique in the indexed column.
Let's dig a little deeper into how SQLServer uses these indexes to access data. Firstly let's see how things work when using the heap for comparison: SQLServer uses the Index Allocation Map (IAM). The IAM is a page that contains a map of all the extents that contain the data for the table. The IAM page directs SQLServer to the location of data pages. Any query that cannot be driven by an index uses the following 3 phases:
As soon as a table has a clustered index, the IAM is no longer used for data access, only for maintenance purposes. Data pages are linked together and data is sorted on the clustered index key. The process scans through a table with a clustered index, finds the first page of a table (its address is stored in sysindexes), reads it, then finds the next page using a pointer to the next page stored at the end of each data page.
Note that thus far we are referring to a 'scan' accessing every page of the table or index. This is as opposed to a 'seek', which is direct access to particular data. If you do a seek, say to a particular cust_id in a customers table, SQLServer finds the root page of the index (from sysindexes) and searches the values in the index. It can thus identify in which page the required record is located it then scans this page for the match. In this way for a large database SQLServer needs to scan 3 pages rather than potentially many.
When accessing data with a non clustered index there is a difference in how the access is performed depending on whether the non-clustered index is created on the heap or over a clustered index. Remember there are only two ways to locate a record: either via its row ID in the heap or via its clustered index.
With just a non-clustered index that matches the where clause of a select SQLServer will look up the root of the index in sysindexes, scan the root page and thus identify the page/ pages that reference the right values.
If there is a clustered index as well the clustered key will be at the non-clustered leaf level taking it straight to the record in a step less than otherwise.
If a table has one clustered index and 3 non-clustered indexes how does SQLServer know which index or indexes to use? The distribution statistics help the query optimiser to choose. Each index has a distribution statistics zone, stored in the statblob column of the sysindexes table.
In SQLServer 2000 the distribution statistics are stored in an image column and its size increases proportionally with the index size for better accuracy. Thus the size of the index has no impact on statistics accuracy.
The distribution statistics sample the actual data. Try the DBCC SHOW_STATISTICS (tablename, columnname). The results will indicate:
If every value were unique the density would be 1/(no. rows). If greater than this then some values exists in 2 or more occurrences. The smaller the better as this indicates the index is more selective. If the density is above 10% the index is considered useless.
Each time a query is run, the first operation performed by the system is an evaluation of the distribution statistics. The distribution statistics reflect the distribution of the actual data based on a sample of the real data. Based on these statistics SQLServer decides on the data access strategy it should use.
Try a query, e.g.
SELECT * FROM Orders WHERE OrderDate BETWEEN '1996-15-12' AND '1996-22-12'
against the Northwind database.
You can display the execution plan for this query in Query Analyzer; you can also view the estimated execution plan based on the distribution statistics (Query menu or via the toolbar). You can then hover over the index seek and bookmark lookup operations to see the detailed estimates. You can then compare them to the real figures of the actual execution plan.
The sp_helpstats system stored procedure provides information on the statistics that exist on a table.
Statistics must be up-to-date to be useful or they might indicate an incorrect choice of index for your queries, defeating the objective of the index to optimise data access. However, by default statistics are updated automatically by SQLServer 2000 and this is the best way to leave it. To check whether this automation is active for a particular table use:
SELECT DATABASEPROPERTYEX('dbname','IsAutoUpdateStatistics')
where a returned value of 1 = true
To set this value:
ALTER DATABASE dbname SET AUTO_UPDATE_STATISTICS ON|OFF
When enabled the update of statistics is handled by SQLServer dependent on the number of updates, delete and inserts, and on the number of records in the table.
You can enable or disable automatic computation of statistics with the sp_autostats system stored procedure, e.g. sp_autostats 'Orders', 'OFF' will turn off all automated stats for the table. You can even turn off the automation for one particular index if you specify the index name as the third parameter.
You can update statistics manually and quite precisely in terms of the number of options at your disposal, via the UPDATE STATISTICS command which includes the following options:
| FULLSCAN | use all the rows to compute the distribution |
| SAMPLE number (PERCENT|ROWS) | or use a subset |
| RESAMPLE | use an inherited sampling ratio |
| ALL|COLUMNS|INDEX | whether column, index or all statistics are updated |
| NORECOMPUTE | don't auto recompute stats if they become out of date |
A query is covered if all the columns it uses come from one or more indexes. These columns include the columns you want the query to return as well as columns in any JOIN, WHERE, HAVING, and ORDER BY clause.
A covering index, which is a form of a composite index, includes all of the columns referenced in SELECT, JOIN, and WHERE clauses of a query. Because of this, the index contains the data you are looking for and SQL Server doesn't have to look up the actual data in the table, reducing logical and/or physical I/O.
On the other hand, if the covering index gets too big (too many columns), this can increase I/O and degrade performance. Generally, when creating covering index, follow these guidelines:
The CREATE INDEX statement has many options. These include:
Default indexes allow duplicate keys to enforce uniqueness you use the UNIQUE clause of the CREATE INDEX statement. You can also create composite indexes. Note that using a composite key offers different performance from two indexes on the same columns.
You can also have indexed computed columns. This is particularly advantageous to performance as otherwise the values of the computed columns are computed on the fly during query execution.
Several options should be set during index creation and index value modification of computed columns:
ANSI_NULLS
ANSI_PADDINGS
ANSI_WARNINGS
ARITHABORT
CONCAT_NULL_YIELDS_NULL
QUOTED_IDENTIFIER
NUMERIC_ROUNDABORT
All should be set to ON except the last, which should be set to OFF. All of these options except the last are correctly set by the OLEDB provider for SQLServer
If any of these seven options are not correctly set the optimiser will not consider the index created on the computed column.
It is possible to create statistics without an index, just to help the query optimiser. Statistics can be created on one or multiple columns. As for index creation, the system will use a full scan or a sample, depending on the size of the table.
See Books Online for examples of the involved syntax for the different forms of index creation.
Although SQLServer undertakes a considerable amount of work automatically for you the DBA/ developer may need to consider data fragmentation. There are two types of fragmentation. Internal fragmentation refers to empty spaces inside pages. External fragmentation refers to page links in an ideal situation pages are linked in their storage order.
Inserts, Updates and Deletes can cause fragmentation.
Inserts may cause leaf level fragmentation. A clustered index is particularly susceptible as the leaf level contains data pages. If the page is full when you come to insert the page is split in two to occupy two pages instead. As you don't know from here in the file this space will be allocated external fragmentation could result. As could internal fragmentation as now the pages are only half full.
If you update a record and as a result the record size is bigger it may not be possible to leave the record in the same page. In this situation SQLServer actually leaves a pointer to the new record at its old location in order to avoid the need to updates any indexes pointing to the record. This does lead to external fragmentation however.
Deletes have a negative impact on internal fragmentation, creating gaps in data pages. In fact SQLServer does not delete the records but simply marks them for deletion so the space can be later reclaimed either by a regular housekeeping procedure or by inserts and updates.
DBCC SHOWCONTIG helps determine external and internal fragmentation at the leaf level. The result categories are as follows:
Let's see what is reported for the Northwind Orders table:
DBCC SHOWCONTIG ('Orders') gives me
DBCC SHOWCONTIG scanning 'Orders' table...
Table: 'Orders' (21575115); index ID: 1, database ID: 6
TABLE level scan performed.
- Pages Scanned................................: 20
- Extents Scanned..............................: 5
- Extent Switches..............................: 4
- Avg. Pages per Extent........................: 4.0
- Scan Density [Best Count:Actual Count].......: 60.00% [3:5]
- Logical Scan Fragmentation ..................: 0.00%
- Extent Scan Fragmentation ...................: 40.00%
- Avg. Bytes Free per Page.....................: 146.5
- Avg. Page Density (full).....................: 98.19%
With a page density of 98% there is not significant internal fragmentation. However a scan density of 60% and an extent scan fragmentation of 50% indicates external fragmentation.
SQLServer proposes three ways to defragment data files
So, if we defragment the Orders table:
DBCC DBREINDEX('Orders','')
we then get the following from DBCC SHOWCONTIG ('Orders'):
DBCC SHOWCONTIG scanning 'Orders' table...
Table: 'Orders' (21575115); index ID: 1, database ID: 6
TABLE level scan performed.
- Pages Scanned................................: 20
- Extents Scanned..............................: 3
- Extent Switches..............................: 2
- Avg. Pages per Extent........................: 6.7
- Scan Density [Best Count:Actual Count].......: 100.00% [3:3]
- Logical Scan Fragmentation ..................: 0.00%
- Extent Scan Fragmentation ...................: 0.00%
- Avg. Bytes Free per Page.....................: 146.5
- Avg. Page Density (full).....................: 98.19%
Now scan density is 100% and scan fragmentation is 0. Data pages and extents are now contiguous.
There we conclude our look at creating and maintaining indexes. In the next article we turn to views, stored procedures, transactions, user defined functions and triggers.
MCSE: SQLServer 2000 Design
Israel and Jones
Sybex
MSDN
Books Online