After the last Index blog post here another technical issue I recently thought to have solved. At least maybe for the next year until I find another solution.
As you know I dropped all indexes of my data warehouse and therefore also unique constrains on my tables. The application should make sure that everything is fine. But I have a regular report which checks, if the constrains are still intact.
Lately after working on a project with huge amount of data, I got hit occasionally with duplicate rows in my hubs and links.
This was the reason to investigate further about why it did happen. Although I have used all measures below, it happened. Maybe the unique constraint indexes have additional functions when loading data to it. I must admit, I’m not the export in that area. I’m educated by mistakes and experience. Therefore, this is a little sum up about my lessons learnt.
Depending on the number of queries running at the same time with SELECT, UPDATE and DELETE, SQL Server decides, which data is locked, read, waited for, overwritten or inserted. There are some methods to influence its decision making.
SET TRANSACTION ISOLATION LEVEL
Working with transactions is usually something for applications having SQL Server as a backend, to ensure, that the data is written correctly to the database spanning multiple table manipulations. In the data area I didn’t respect to work with transactions for some time, as I was usually transforming data from source to target, hardly inserting data in parallel into the same table. With Data Vault it is rather a usual pattern.
The transaction is written like that:
BEGIN TRANSACTION
INSERT INTO ...
SELECT *
FROM ...
COMMIT
OPTION(RECOMPILE)
By default I add OPTION(RECOMPILE)
after fighting against bad execution plans.
This layout I had in place, when I discovered duplicate entries. My search for a solution led me to learn more about Transaction Isolation Level. SQL Server has a few of them:
- READ UNCOMMITTED
- READ COMMITTED (default)
- REPEATABLE READ
- SNAPSHOT
- SERIALIZABLE
I don’t want to explain all the nifty details about each option. There are plenty of resources at google.
I suspected that although I had all those table locks (see below) in place, data got changed between an INSERT and its SELECT statement. Especially in fact loading with looking up dimension keys. Sounds weird, but I couldn’t help suspecting.
Many websites express, that SERIALIZABLE is the most secure option, but has a very hard disadvantage – it leads to deadlocks. A deadlock is an issue when SQL Server tries to find out what to do next in a safe way. Sometimes 2 queries want to have access to the same resource, but queuing won’t help as other resources are blocked too. SQL Server terminates 1 query telling it that is a „victim of deadlock“. With restartable jobs this isn’t an issue. But still bad.
I added the following statement to my scripts:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
All following transactions are then executed with that isolation level.
I was hit with a lot of deadlock errors in the coming weeks. But I got the issue solved about having duplicates in hubs and links.
Hints
Hints are another way to manipulate the SQL Servers decision engine. There are a lot of hints, which we can add to a query. Some are affecting the whole query manipulating the execution plan (query hints) and some the queried table (table hints).
Query hints are added at the end with OPTION(something)
. As you know, by default I add RECOMPILE
to it to enforce getting the best plan for every data set instead of recycling old plans. Sometimes I add FORCE ORDER
or HASH JOIN
to it, depending on performance.
To combat my issue, I rather use table hints. My favorite is TABLOCKX
. By default, SQL Server tries to lock as little as possible. Sometimes only pages or rows of a table. But I want to be sure that no other query sneaks into a running query manipulating something in between. TABLOCK
will try to grab a „shared“ lock on that table while the X
gets an exclusive lock.
My statement looked usually like that:
INSERT INTO ... WITH (TABLOCKX)
SELECT *
FROM ...
But somehow, I still got duplicates in my hubs and links. How could that be? It must be something with the SELECT
query. My SELECT
query has also a part in it where it checks the very same table, I’m inserting data into for records that already exists. Could it be that the INSERT
locks are not applied to the SELECT
tables although being the same table?
Reading through other blogs I have seen following recipe:
INSERT INTO ... WITH (TABLOCKX)
SELECT *
FROM ... WITH (UPDLOCK)
UPDLOCK
acquires a modify lock telling the system it intends to modify the data but in fact I won’t modify it. I just want that other queries will modify them when I’m reading it.
Maybe this is the way to go, maybe not. It’s hard to find something useful in the data warehousing space as usually all the questions asked are for applications and record level issues.
While we talk about table hints, I got also into another issue. In my pattern collection I have 3 different insert methods. Each optimized for perfect fast loading. One is for the initial load. When the table is empty, I don’t need to retrieve the last record e.g. with Satellite loading for comparing hash values and deciding if it is a change or not. If the table is empty, just pump the data in.
IF EXISTS (SELECT 1 FROM table)
QUERY INITIAL INSERT
ELSE
QUERY NORMAL INSERT
So, what happens? The first query will check if there is a record in the table. Then it will run query 1 or query 2. But what happens in the split second between the 2 queries? Or maybe there is another INSERT
already running? As the first query will run in READ COMMITTED
it won’t find a record and tells the initial insert to not check for existing data.
We must modify it a little and adding table hints.
IF EXISTS (SELECT 1 FROM table WITH(UPDLOCK,HOLDLOCK))
QUERY INITIAL INSERT
ELSE
QUERY INSERT
The UPDLOCK
will wait until any insert has finished and sets a modify lock. If the query is run with only UPDLOCK
it will check for the record and releases the lock immediately again. The HOLDLOCK
will keep the LOCK
until the transaction is committed. The first statements sets a lot which is kept until the end.
The final layout looked like that:
IF EXISTS (SELECT 1 FROM table WITH(UPDLOCK,HOLDLOCK))
INSERT INTO ... WITH (TABLOCKX,HOLDLOCK)
SELECT *
FROM ... WITH (UPDLOCK,HOLDLOCK)
ELSE
QUERY INSERT
I believe there are way too many locks. And again, I had a lot of deadlock issues to deal with.
Is there not something easier to work with? Maybe something like a queue?
Application Lock
There is another method in SQL Server to work with: Application Lock. This means, it is completely independent of any table locks.
With this lock we can invent a locking mechanism by defining a random name for a resource. As my procedures are loading data to a target, I use the target name as a resource. It has nothing to do with the underlying table and doesn’t affect it.
How to set a lock:
sp_getapplock @Resource = 'table', @LockMode = 'Exclusive', @LockOwner = 'Session'
With this statement we request a virtual lock for the resource named ‚table‘. If any other query or stored procedure issues the same statement it will get into the queue and waits until the first lock gets released.
There are 2 levels to define these locks:
If we define @LockOwner = 'Transaction'
the lock gets released after the commit.
But as I could have multiple transactions in one script, I rather like to specify it on session level. The whole script has to run through until I release the lock with the following statement at the end.
sp_releaseapplock @Resource = 'table', @LockOwner = 'Session'
My full script looks like that:
DECLARE @v_lock_code_{table} INT
EXEC @v_lock_code_{table} = sp_getapplock @Resource = '{table}', @LockMode = 'Exclusive', @LockOwner = 'Session'
IF @v_lock_code_{table} NOT IN (0,1)
BEGIN
RAISERROR('Unable to acquire exclusive lock on {table}', 16, 1)
RETURN
END
... a lot of code ...
EXEC @v_lock_code_{table} = sp_releaseapplock @Resource = '{table}', @LockOwner = 'Session'
Summary
With this method I create locks on all data manipulation objects. The queue will complete query after query and tries not to sneak into other workflows. If I finish an object, I release the lock and everything is good.
The focus is really to have a queue for every INSERT
or UPDATE
object.
SELECT
queries or JOIN
s don’t need to issue this lock. The table is free to read. By the way: with the table hint method above there is also a hint named NOLOCK
which should disrespect any existing locks. But when I use it, I seldom get any data returned until all locks are released.
With application locks I could also move back to READ COMMITTED
isolation level.
Another side effect is, that other non-related queries are not locked too. Even though a TABLOCKX
is only for a specific table, I have experienced locks on other transactions and in other databases. Lately I made a test. I stopped the whole execution engine on my test machine and started all jobs at the same time. I have never seen so much traffic on my databases and CPU was at 100%.
I hope, I have now peace of mind and no dead locks anymore.
Apr. 11 2020
SQL Server locking methods or the fight against dead locks
After the last Index blog post here another technical issue I recently thought to have solved. At least maybe for the next year until I find another solution.
As you know I dropped all indexes of my data warehouse and therefore also unique constrains on my tables. The application should make sure that everything is fine. But I have a regular report which checks, if the constrains are still intact.
Lately after working on a project with huge amount of data, I got hit occasionally with duplicate rows in my hubs and links.
This was the reason to investigate further about why it did happen. Although I have used all measures below, it happened. Maybe the unique constraint indexes have additional functions when loading data to it. I must admit, I’m not the export in that area. I’m educated by mistakes and experience. Therefore, this is a little sum up about my lessons learnt.
Depending on the number of queries running at the same time with SELECT, UPDATE and DELETE, SQL Server decides, which data is locked, read, waited for, overwritten or inserted. There are some methods to influence its decision making.
SET TRANSACTION ISOLATION LEVEL
Working with transactions is usually something for applications having SQL Server as a backend, to ensure, that the data is written correctly to the database spanning multiple table manipulations. In the data area I didn’t respect to work with transactions for some time, as I was usually transforming data from source to target, hardly inserting data in parallel into the same table. With Data Vault it is rather a usual pattern.
The transaction is written like that:
By default I add
OPTION(RECOMPILE)
after fighting against bad execution plans.This layout I had in place, when I discovered duplicate entries. My search for a solution led me to learn more about Transaction Isolation Level. SQL Server has a few of them:
I don’t want to explain all the nifty details about each option. There are plenty of resources at google.
I suspected that although I had all those table locks (see below) in place, data got changed between an INSERT and its SELECT statement. Especially in fact loading with looking up dimension keys. Sounds weird, but I couldn’t help suspecting.
Many websites express, that SERIALIZABLE is the most secure option, but has a very hard disadvantage – it leads to deadlocks. A deadlock is an issue when SQL Server tries to find out what to do next in a safe way. Sometimes 2 queries want to have access to the same resource, but queuing won’t help as other resources are blocked too. SQL Server terminates 1 query telling it that is a „victim of deadlock“. With restartable jobs this isn’t an issue. But still bad.
I added the following statement to my scripts:
All following transactions are then executed with that isolation level.
I was hit with a lot of deadlock errors in the coming weeks. But I got the issue solved about having duplicates in hubs and links.
Hints
Hints are another way to manipulate the SQL Servers decision engine. There are a lot of hints, which we can add to a query. Some are affecting the whole query manipulating the execution plan (query hints) and some the queried table (table hints).
Query hints are added at the end with
OPTION(something)
. As you know, by default I addRECOMPILE
to it to enforce getting the best plan for every data set instead of recycling old plans. Sometimes I addFORCE ORDER
orHASH JOIN
to it, depending on performance.To combat my issue, I rather use table hints. My favorite is
TABLOCKX
. By default, SQL Server tries to lock as little as possible. Sometimes only pages or rows of a table. But I want to be sure that no other query sneaks into a running query manipulating something in between.TABLOCK
will try to grab a „shared“ lock on that table while theX
gets an exclusive lock.My statement looked usually like that:
But somehow, I still got duplicates in my hubs and links. How could that be? It must be something with the
SELECT
query. MySELECT
query has also a part in it where it checks the very same table, I’m inserting data into for records that already exists. Could it be that theINSERT
locks are not applied to theSELECT
tables although being the same table?Reading through other blogs I have seen following recipe:
UPDLOCK
acquires a modify lock telling the system it intends to modify the data but in fact I won’t modify it. I just want that other queries will modify them when I’m reading it.Maybe this is the way to go, maybe not. It’s hard to find something useful in the data warehousing space as usually all the questions asked are for applications and record level issues.
While we talk about table hints, I got also into another issue. In my pattern collection I have 3 different insert methods. Each optimized for perfect fast loading. One is for the initial load. When the table is empty, I don’t need to retrieve the last record e.g. with Satellite loading for comparing hash values and deciding if it is a change or not. If the table is empty, just pump the data in.
So, what happens? The first query will check if there is a record in the table. Then it will run query 1 or query 2. But what happens in the split second between the 2 queries? Or maybe there is another
INSERT
already running? As the first query will run inREAD COMMITTED
it won’t find a record and tells the initial insert to not check for existing data.We must modify it a little and adding table hints.
The
UPDLOCK
will wait until any insert has finished and sets a modify lock. If the query is run with onlyUPDLOCK
it will check for the record and releases the lock immediately again. TheHOLDLOCK
will keep theLOCK
until the transaction is committed. The first statements sets a lot which is kept until the end.The final layout looked like that:
I believe there are way too many locks. And again, I had a lot of deadlock issues to deal with.
Is there not something easier to work with? Maybe something like a queue?
Application Lock
There is another method in SQL Server to work with: Application Lock. This means, it is completely independent of any table locks.
With this lock we can invent a locking mechanism by defining a random name for a resource. As my procedures are loading data to a target, I use the target name as a resource. It has nothing to do with the underlying table and doesn’t affect it.
How to set a lock:
With this statement we request a virtual lock for the resource named ‚table‘. If any other query or stored procedure issues the same statement it will get into the queue and waits until the first lock gets released.
There are 2 levels to define these locks:
If we define
@LockOwner = 'Transaction'
the lock gets released after the commit.But as I could have multiple transactions in one script, I rather like to specify it on session level. The whole script has to run through until I release the lock with the following statement at the end.
My full script looks like that:
Summary
With this method I create locks on all data manipulation objects. The queue will complete query after query and tries not to sneak into other workflows. If I finish an object, I release the lock and everything is good.
The focus is really to have a queue for every
INSERT
orUPDATE
object.SELECT
queries orJOIN
s don’t need to issue this lock. The table is free to read. By the way: with the table hint method above there is also a hint namedNOLOCK
which should disrespect any existing locks. But when I use it, I seldom get any data returned until all locks are released.With application locks I could also move back to
READ COMMITTED
isolation level.Another side effect is, that other non-related queries are not locked too. Even though a
TABLOCKX
is only for a specific table, I have experienced locks on other transactions and in other databases. Lately I made a test. I stopped the whole execution engine on my test machine and started all jobs at the same time. I have never seen so much traffic on my databases and CPU was at 100%.I hope, I have now peace of mind and no dead locks anymore.
By ecki • Business Intelligence 0