SQL Server 2000 Design and Implementation, Part XII -
Analysing and Optimising Data Access...
This series of articles is focusing primarily on the physical data store Microsoft developers are most likely to use for their applications, SQL Server. The main reference for this series of articles is MCSE: SQL Server 2000 Design by Israel and Jones.
This series of articles is focussing primarily on the physical data store Microsoft developers are most likely to use for their applications, SQLServer. The main reference for this series of articles is MCSE: SQLServer 2000 Design by Israel and Jones.
Here's a quick overview of what we've covered so far as well as what we shall be covering:
Article I: we started with developing a logical data model, before looking at how we go about implementing the physical database in SQLServer 2000.
Article II: creating and maintaining databases, in particular the files that comprise the database and how they can be split into filegroups and why they should be on occasion.
Article III: creating and maintaining tables - looking at what we can store in tables, and how we do so.
Article IV: data integrity - default values, check constraints and rules, primary keys, unique constraints, as well as foreign keys and relationships.
Article V: indexes and the related area of statistics.
Article VI: creating and maintaining the various database objects that exist- views, stored procedures, transactions, user defined functions and triggers. This concludes the look at the physical design per se. The remainder of the articles look at issues relating to data access.
Article VII: accessing data
Article VIII: modifying data
Article IX: importing and exporting data
Article X: locking
Article XI: security
Article XII: analysing and optimising data access
This is article XII, last article in the series.
Over time the database and queries of an application require maintenance to maintain performance levels. Also key is ensuring performance is optimal in the first place. The supplied tools to assist in these activities are options in query analyser, SQLServer profiler and the stored procedure debugger.
Inside the SQLServer query-processing engine, there is a section of code called the query optimiser, the function of which is to find the most efficient means of retrieving data at that particular time. The query optimiser compares different possible methods of retrieving the data (called execution plans) and chooses the one it estimates to be the most efficient for the query engine to use.
This highlights one of the advantages of stored procedures the query plans are usually pre-determined so the time to identify the plan is eliminated. As it can take as much time to find the plan as execute the query this can result in significant timesavings.
SQLServer uses a cost based query optimiser, assigning a cost to each method of retrieving data based on the resources required to process the query. Processor time, disk I/O etc. are all assigned costs based on the number of rows that must be examined in a particular operation. Once the optimiser has assigned costs it sums up the totals for each execution plan that was investigated and chooses one based on the design of the system and sends this to the query engine for processing.
Contrary to what you might expect SQLServer does not always choose the execution plan with the lowest total resource cost. Instead SQLServer is designed to select the execution plan that is reasonably close to the theoretical minimum and will return results to the client as quickly as possible with a reasonable cost. The definition of reasonable will change as conditions within the SQLServer change, e.g. load.
Two types of optimisation can occur: single statement and batch. We'll looking at single statements first, focussing on SELECTS as they're the most commonly executed statements in an application and as INSERTs, UPDATEs and DELETEs follow a similar pattern.
The steps that the relational engine will go through to satisfy a query make up an execution plan. There are two items that all execution plans determine for the relational engine:
The order in which the tables are accessed is important not for which rows will appear in the result set but rather for the speed of the joins between tables. If the first data retrieval from a table can be done quickly and returns a small number of rows, then subsequent join operations have less data to work with and can complete more quickly.
The optimiser will have a few choices for most tables with regards to the method of extraction: table scan or use an index, or indexes. If the table is small it is likely to be more efficient just to use a table scan. Otherwise the index is likely to be quicker.
For queries that include a join there is a third step that determines the type of join that will be used to combine the information from the various tables.
As there may be many execution plans that would satisfy a query, SQLServer will 'guesstimate' at which plans have a cost close to the theoretical minimum cost via various algorithms.
You can view the execution plan for a query in a couple of ways:
Optimising a batch is very similar to optimising a single query. The server will execute and optimise each statement sequentially in the order of presentation in the batch. There are a few extra things to be aware of:
In order to analyze a query you must change the behaviour of query analyser or another client tool to report back information on the query. TSQL SET statements can achieve this. The following options are useful in analyzing and optimizing queries and batches (all can be set to either ON or OFF):
| Option | Description |
|---|---|
| SET FORCEPLAN | the join in a query will be processed in the same order as the tables appear in the FROM clause |
| SET NOEXEC | prevents the execution of the query just the execution plan is calculated |
| SET SHOWPLAN_ALL | instructs SQLServer to not execute the statement in a batch but to return detailed information about how the queries are executed |
| SET SHOWPLAN_TEXT | as SHOWPLAN_ALL but in a more readable format for MSDOS applications |
| SET STATISTICS IO | determines whether statistical information about the various IO operations is returned to the client |
| SET STATISTICS PROFILE | determines whether the profile information for ad hoc queries, views, triggers, and stored procedures is displayed |
| SET STATISTICS TIME | determines whether the server returns the time required to parse, compile and execute each statement |
Obtaining the execution plan for the query is the easy bit; the difficult bit is performing the analysis of the plan and deciding what steps to take to ensure this query performs as optimally as possible. This is a bit of an art but here are some points of advice based on the various operations that can appear in a query plan:
Bookmark Lookup
An index was used but as the index does not 'cover' the query the query processor must use the ROWID from the
index to 'lookup' the actual data row. Unless the index can be alter to cover the columns of the query this
cannot be optimised further.
Clustered Index Scan
Similar to a table scan (its just that the table has a clustered index) and similarly undesirable, particularly
for large tables. If you see these then the columns in the query are either not indexed or the distribution
statistics have led the query optimiser to decide the scan is more/ as efficient. Restructuring the query can
sometimes eliminate these operations; or add an index.
Clustered Index Seek
The clustered index is used to find matching rows. This is optimal behaviour.
Compute Scalar
As it says a scalar is being computed. Unless this is not really needed it can't be optimised. It its not
needed, remove it!
Constant Scan
The query requires a constant value in some (or all) rows
Hash Match
The query processor builds a hash table for each row being processed. As subsequent rows are processed, the
hash is computed and compared to the hash table for matches. Queries with DISTINCT, UNION, or aggregates often
require a hash table to remove duplicates. If such operations are required there is little you can do to optimise.
Index Scan
The non-clustered index is being used to locate a large number of data rows which must then be scanned. Unless
you can restructure the query to return fewer rows this cannot be optimised.
Index Seek
A non-clustered index is being used and only a small part of the index is required. You've chosen your indexes
well as this is one of the most efficient operations.
Index Spool
As rows are scanned tempdb is used to store a 'spool' table that can be used rather than re-reading input rows.
This is an internal optimization for complex queries and cannot be changed.
Merge Join
Occurs when the two inputs contain sorted data and the query processor can merge them together, or when two or
more indexes are used to query a table. Very efficient but to occur the joins must have access to sorted data
achieved by indexing on join or ORDER BY columns.
Nested Loop
In this join, one table is chosen as the inner and scanned for each row of the outer. This is only efficient for
small numbers of rows. Restructuring the query can remove these joins. Note that if available memory is low
these are more likely to occur.
Remote Query
As it says a query occurring on a remote data source so you either need to optimise on the remote data source
or move the necessary data to your local SQLServer.
Sort
Is expensive but tends also to be necessary.
Table Scan
Is performed when the table has no clustered index (its a heap). Probably the least desirable operation to see
chosen as the query processor will read each row into memory in order to decide whether it should be returned by
the query. For tables of more than a few hundred rows you should add an appropriate index.
Some of the gains in performance come from writing well structured, efficient queries that return the minimum amount of information needed, but most gains in performance are made by choosing good indexes.
Hence the basic recommendation is to ensure that there are indexes on all tables and that the statistics for those indexes are up to date. Not coincidentally, these are the two main factors that influence the query optimiser in making decisions about the execution plan.
At this stage I would recommend you run a few multi-table queries from the Northwind database, or your own, in Query Analyzer and assess whether the execution plans generated are optimal based on the above information.
As just stated, indexes and rewriting queries are the most common options for optimisation but you can also use query hints though care should be taken to ensure youre not forcing the server to choose a sub-optimal execution plan. Generally query hints are not recommended SQLServer knows best!
Considering indexes: in order for SQLServer to make use of an index, the first indexed column must be included in the WHERE clause as part of a qualification, or it will not be considered. However, all columns in a WHERE clause do not need to be included in the index for it to be chosen.
There can only be one clustered index per table and, as a result, it should be chosen carefully. By default the primary key is the clustered index but often this is not the best choice unless it is the only index that will exist on the table. Clustered indexes are best used for range queries, e.g. dates or numerical ranges.
Non-clustered indexes work best on large tables when very few rows are being returned. The query optimiser will often choose a non-clustered index that is highly selective when the index columns are included in the join statements. When this occurs SQLServer can find the row needed in the index very quickly and get to the actual data quickly as well. If not selective the process is much less efficient and the index may not be chosen. An exception is if the non-clustered index is a covering index. Foreign keys are usually good choices for non-clustered indexes as usually these columns are used in joins.
Regardless of the type of index there are a few general guidelines for creating indexes:
Turning now to consider query hints briefly. These should only be used if you are sure you can do a better job than the SQLServer, as discussed above. There are four types of hints:
| join | only available to ANSI syntax joins; valid choices are HASH, LOOP, MERGE and REMOTE |
| index | forces the query processor to use a specified index specified by name or ID, with 0 indicating a table scan. Note that IDs can change as indexes are dropped and recreated so are best not used |
| query processing | a variety are available: grouping, union, join, FAST, FORCE ORDER, MAXDOP, ROBUST PLAN, KEEP PLAN |
| lock | controls how SQLServer will apply locks options HOLDLOCK, NOLOCK, PAGLOCK, READCOMMITTED, READPAST, READUNCOMMITTED, REPEATABLEREAD, ROWLOCK, SERILIZABLE, TABLOCK, TABLOCKX, UPDLOCK, XLOCK |
The hints are placed at the end of the SELECT query following the keyword OPTION. The exception is the LOCK hints which are placed in the FROM clause of a query.
See books online for further details of the individual options though you will be familiar with some of them from earlier articles in this series.
The profiler is a flexible tool for capturing the processes that are occurring in real time. Profiler has various uses, key of which is performance tuning. Profiler can provide you with information about which queries need to be optimised.
The profiler can monitor all traffic that occurs between a client and the server and provide details on the queries or statements being executed on the server. This information can be filtered by specific workstations, logins, databases, types of statements, etc. The profiler can also capture nearly every event that occurs inside SQLServer. There are 13 classes with hundreds of events that can be added to a trace.
The profiler can also gather information from the server directly and store that information without requiring the client to be running. The captured information can then be replayed against this, or another server, at a later date.
Profiler activity is based on the concept of traces. Each trace is based on a template trace that includes events, data columns, filters and options for a particular type of trace. You can define your own templates if wanted. Once the trace is defined it can be run at any time in the future.
An issue that crops up with heavily loaded SQLServer systems is contention for data in certain tables. Profiler includes a number of events that can be used to detect blocks and deadlocks: Lock:Acquired, Lock:Cancel, Lock:Deadlock, Lock:Deadlock Chain, Lock:Escalation, Lock:Released, Lock:Timeout. If there are contention problems then adding these events to a trace along with TSQL:SQL:StmtStarting and TSQL:SQL:StmtCompleted can capture activity that shows how locks are acquired an released by various users. If particular tables are suspected a filter can be applied to the trace to target the information captured.
Lock problems can be verified by the profiler by comparing the timestamps for when the statement starts and when the lock is acquired. Once verified you may consider rewriting the involved queries or redesigning the application.
You can also use trace flags to monitor lock activity. The DBCC TRACEON <trace flag> instructs SQLServer to report certain debugging information back to the user. The specific trace flag for lock activity is 1209.
SQLServer 2000 includes a wizard in Profiler to assist DBAs in developing indexes. The tool takes a series of queries and then develops execution plans for the queries based on the tables and indexes that exist on the server. The wizard then considers the different execution plans that would be generated given possible indexes that could be added to the tables referenced in the queries. From this analysis the wizard then provides a list of recommended indexes that would optimise the queries it has analysed.
The index tuning wizard can either be fed a single query to analyse or it take a workload from the Profiler to analyse a series of queries that actually occurred on the server.
If you haven't done so previously I would suggest you try using profiler to capture and replay a trace as well as detect locks and try the Index tuning Wizard. Here are a few suggestions to get you started.
The process is similar to optimising a batch or single query. The first time a stored procedure or a trigger is executed the source is compiled into an execution plan. This occurs each time the server is restarted, as existing execution plans are lost when the server is stopped. The caching of these execution plans is the reason why substantial performance gains can occur if you use stored procedures rather than batched TSQL. SQLSever 2000 stored procedure execution plans are re-entrant more than one user can run the same execution plan at the same time.
One further aspect of optimising performance where stored procedures are concerned is to prevent unnecessary recompilation, as this is an expensive operation. A stored procedure will be recompiled automatically when one of the following occurs:
It is therefore best practice to avoid the last two.
With SQLServer 2000 a debugger for stored procedures has been built into query analyser, quite well hidden but accessible from the object browser or object search windows. You select the stored procedure you wish to debug then select debug. The debugger works like most modern debuggers: the user can set breakpoints, step through code, change variables values, etc.
You also have Visual Studio .NETs debugging facilities which can be use to debug stored procedures. Whichever debugger you use remember they should not be used against production servers as you will severely compromise performance.
That's it ... this series of articles on SQLServer is now complete. Hopefully it's taught you one or two things you did not know before and you feel that you have a more rounded appreciation of what SQLServer can do as well as how it does it.
MCSE: SQLServer 2000 Design
Israel and Jones
Sybex
MSDN
Books Online