About Me

My photo
Singapore, Singapore, Singapore
I am currently working as a DB Consultant with Ashnik. We are into Open Source and Cloud based enterprise class solutions. Prior to this I was working as a DBA @ Misys India (Banking Services). Working on IBM DB2, Oracle Microsoft SQL Server. Selected as IBM Champion for Data Management 2011. I like bit of photography, cycling, poetry and I just love gadgets!!

Connect to me

See my profile on LinkedIn Visit my Ashnik Website

MERGE: MAKE Your UPSERTs quick!!!

Hi guys... I am back... this time as a Advanced DB2 9.7 DBA... Thanks to Sathy for helping me prepare for the exam and Beth Flood for giving me a free voucher (otherwise I never had guts to put 5k on stake for my capabilities)... But very soon an all new DB2 is expected I am hope it will have hell lot of new features and I will have to upgrade or take the certification exams from scratch...

Well recently I have been working with some scenarios where people came to me with an update statement which was working very slow... When I looked at it, I said "I need to tell people there is something like MERGE"... There was one thing common, all the statements were having correlated sub-queries for set value... I replaced their queries with a MERGE and it worked like wonder...

Let's see some example...

UPDATE Table1
SET T1Col1=(select T2Col1 from Tables2 where Table1.T1IDPK=T2IDPK);


So the problem with this query is the inner query will be executed for every row of table Table1...

So I wrote them a query MERGE, which shall first create a HASH JOIN and then do an RID based update:

MERGE INTO Table1
USING Table2 on (Table1.T1IDPK=Table2.T2IDPK)
WHEN MATCHED THEN UPDATE
SET Table1.T1Col1=Table2.T2Col1;


Another scenario was of a process where a scheduled process will pick up rows from a table (which has designated pagenumbers/rownumers) in batches and after processing certain number of rows in every batch the scheduled process will sleep... Remaining rows shall be renumbered and partitioned in batches...

UPDATE PagedTab t1
set SEQNUM=(select row_number() over(order by PagedTab_IDPKColumn) from PagedTab t2 where t1.PagedTab_IDPKColumn=t2.PagedTab_IDPKColumn);


So I just changed it to below:

MERGE INTO PagedTab t1
USING (select PagedTab_IDPKColumn,row_number() over(order by PagedTab_IDPKColumn) as rnum from PagedTab ) t2
on t1.PagedTab_IDPKColumn=t2.PagedTab_IDPKColumn
WHEN MATCHED THEN UPDATE
SET t1.SEQNUM=t2.rnum;


I have simulated the below scenrio to make it more easy to be understood as the actual was more complex...
Third place was where there was a Mater-Detail kind (Dependent Entity) kind of relationship and some details were stored in Master Table and for every IDPK in master table one and only one record existed in detail table... ACCOUNTID and LASTACCESSDATE and STATUS were some of the details... some one wanted to first find out the LATEST record (based on date) from detail table for every ACCOUNTID and then find out primary key for that record and then for that PK whatever status is found in detail had to bee copied to master... So here goes the query people will write in first attempt...

UPDATE MasterTable outTab
SET STATUS=
(select STATUS from DetailsTable t2 join
(select ACCOUNTID, MAX(LASTACCESSDATE) DT from DetailTable group by ACCOUNTID)t1
on t1.ACCOUNTID=t2.ACCOUNTID and t2.LASTACCESSDATE=t1.DT and t2.IDPK=outTab.IDPK
)temp;



Actually the query which was written had used one more IN clause which I have converted to JOIN to avoid more confusion...

I modified it...

MERGE into MasterTable t1
USING
(select IDPK, DENSE_RANK() over(PARTITION BY ACCOUNTID order by LASTACCESSDATE) rn, STATUS from DetailsTable) t2
ON (t1.IDPK=t2.IDPK and rn=1)
WHEN MATCHED THEN UPDATE
SET STATUS=t2.STATUS;


And this avoided a join and sort which was being done for row... Did a single sort a hash join and an update based on RID... Wow!!!

One final instance where I saw something like this:

UPDATE TableX X
SET
Col1=(select Col1 from TableY Y where X.IDPK=Y.IDPK),
Col2=(select Col2 from TableY Y where X.IDPK=Y.IDPK),
Col3=(select Col3 from TableY Y where X.IDPK=Y.IDPK),
Col4=(select Col4 from TableY Y where X.IDPK=Y.IDPK)
;


and then

INSERT INTO TableX (COl1, COl2, COl3, COl4)
Select COl1, COl2, COl3, Col4 from TableY Y
where not exists (select 1 from TableX X where X.IDPK=Y.IDPK);


and finally...

DELETE FROM TableX X where not exists (select 1 from TableY where X.IDPK=Y.IDPK);

I will leave it upto the readers to interpret the purpose of these statements... I will just give an alternate query...


MERGE INTO TableX X
USING TableY Y on (X.IDPK=Y.IDPK)
WHEN MATCHED then UPDATE
SET
X.COl1=Y.COl1,
X.COl2=Y.COl2,
X.COl3=Y.COl3,
X.COl4=Y.COl4
WHEN NOT MATCHED THEN INSERT(COl1, COl2, COl3, COl4)
VALUES (Y.COl1, Y.COl2, Y.COl3, Y.COl4)
;


This has combined the UPDATE and INSERT statements and is much more efficient cause the hash join is executed once as opposed to execution of 4 correlated sub-queries for every row in TableX for Update and then for INSERT an anti-join is performed...

The delete can not be optimized or combined in merge which is a little disappointing and I hope to see this feature in DB2 very soon... A clause of "WHEN NOT MATCHED BY SOURCE" exists in SQL Server using which I could have just said:


MERGE INTO TableX X
USING TableY Y on (X.IDPK=Y.IDPK)
WHEN MATCHED then UPDATE
SET
X.COl1=Y.COl1,
X.COl2=Y.COl2,
X.COl3=Y.COl3,
X.COl4=Y.COl4
WHEN NOT MATCHED THEN INSERT(COl1, COl2, COl3, COl4)
VALUES (Y.COl1, Y.COl2, Y.COl3, Y.COl4)
WHEN NOT MATCHED BY SOURCE THEN DELETE
;


Well this blog is bloated too much and I do not want to bore people, you can have multiple WHEN MATCHED and WHEN NOT MATCEHD clauses and combine with another condition (AND col1=col2) and have different different action for every branch, hence you will write one single statement instead of writing one update for every condition... I will try to put some examples sometime...

BTW thanks to all my readers... I feel lucky and honored to be listed in "My Favorite DB2 Blogs" by Troy Coleman @ db2tutor

If you are reading this Mr. Coleman, trust me db2tutor and db2geek are my frist reference points for any query... :) and thanks for putting me on the list...

SQL Anti Logic: Usage of CASE Statement

Hello all... Reviewing queries while development and fixing performance issues in SQLs are part of my job... I get lot of queries where people have written a really bad query and thank god we have this process in place to tune such queries... You can take this as a sequel to my Post on "Anti SQL Logic"

So someone had a requirement of getting Account Statistics from a table storing statistics for every account (I have tweaked the requirement and actual query, a little for abstraction and a little to highlight the significance of improvements done in query) ... Stats are stored in periodical fashion and I can configure a period code to get statistics for a specific period say monthly or say yearly... Additionally I want stats for my current period as well as for my previous period...

Probably a naive approach that come first in mind i below:

1) I get a union of all the possible combinations for previous period and current period. Below is basic query for that...


SELECT AST.* FROM {?schemaName}.ACCSTATSTABLE AST, {?schemaName}.ACCOUNTTABLE ACC
WHERE ACC.ACCOUNTNUMBER = AST.ACCOUNTNUMBER
AND ACC.ACCOUNTNUMBER = '{?inputACCOUNTNUMBER}'
AND 'Q' = (SELECT VALUE FROM CONFIGTABLE WHERE CONFIGNAME='StatsPeriod' )
AND {?FORMONTH} > (AST.STATSPERIOD - 1) * 3 +1 //for current period
--AND {?FORMONTH} <= (AST.STATSPERIOD-1) * 3 //for previous period AND AST.FORYEAR = '{?FORYEAR}'


UNION

SELECT AST.* FROM {?schemaName}.ACCSTATSTABLE AST, {?schemaName}.ACCOUNTTABLE ACC
WHERE ACC.ACCOUNTNUMBER = AST.ACCOUNTNUMBER
AND ACC.ACCOUNTNUMBER = '{?inputACCOUNTNUMBER}'
AND 'H' = (SELECT VALUE FROM CONFIGTABLE WHERE CONFIGNAME='StatsPeriod' )
AND {?FORMONTH} > (AST.STATSPERIOD - 1) * 6 +1 //for current period
AND {?FORMONTH} <= AST.STATSPERIOD * 6 // for previous period AND AST.FORYEAR = {?FORYEAR}


UNION

SELECT AST.* FROM {?schemaName}.ACCSTATSTABLE AST, {?schemaName}.ACCOUNTTABLE ACC
WHERE ACC.ACCOUNTNUMBER = AST.ACCOUNTNUMBER
AND ACC.ACCOUNTNUMBER = '{?inputACCOUNTNUMBER}'
AND 'Y' = (SELECT VALUE FROM CONFIGTABLE WHERE CONFIGNAME='StatsPeriod' )
AND AST.FORYEAR ={?FORYEAR} //for curr period
-- AND AST.FORYEST={?FORYEAR}-1 //for prev period


2) I also get a result of last month/quarter and 2nd-half of previous year result sets (so that if I pass 1st month as my input then I shall get previous years last month ;-o )

3) I join these results sets

4) Use an inner query to get what is the configured period code and then I get stats only for that period code...

Well this is quite procedural and algorithmic... SQLs are not procedural (and anyways I am personally very much against usage of PLs codes with RDBMS especially the cursor based things, well that is debated topic)...

Well I wrote a Structured Query to get what was required making the best I can make of CASE:

1) Select ALL the columns for statitics twice
i) Once actual value if it your period value (e.g. 4th quarter) else select it as 0
ii) Select stats column as 0 if it matches your period value else select actual value
This will help you categorize stats in column wise fashion where current period columns will be holding actual value and 2nd set of columns will be holding data for other periods.

2) now you need to filter the current period and previous data
i) Filtering current period data is pretty easy as you know the period number
ii) For filtering the period number for previous period I used below case logic


(CASE 'M'
WHEN 'M' THEN 2
WHEN 'Q' THEN (2-1/3)+1
WHEN 'H' THEN (2-1/6)+1
WHEN 'Y' THEN AST.STATSPERIOD
END) in
( {PeriodNumber}, //Current Period number
(CASE {?FORMONTH} WHEN 1 THEN //Logic to arrive at previous period
case {PERIODCODE} when 'M' THEN 12 //if monthly then prev month is 12
WHEN 'Q' THEN 4 //if quarterly then prev quarter is 3
WHEN 'H' THEN 2 //if halfyearly then prev period is 2
ELSE {PeriodNumber} //if yearly, prev period number does not matter
END
ELSE {PeriodNumber}-1 end) //finally it should be input period number-1
)
AND AST.FORYEAR IN( {?FORYEAR},
(CASE {?PERIODCODE} WHEN 'Y' THEN {?FORYEAR}-1
else
(case when {PeriodNumber}=1 then {?FORYEAR}-1
//If my input number is 1 then I need to consider prev year stats as well
ELSE {?FORYEAR} END)
END)

3) Now I have got 0 as prev period stats in rows which belong to current period and 0 as current period stats for rows which belong to prev year. All I need to do is sum
SELECT
SUM(temp.CURR_MAXIMUM_BALANCE),
SUM(temp.CURR_MINIMUM_BALANCE),
SUM(temp.CURR_AVG_BALANCE),
SUM(temp.CURR_CREDIT_COUNTER ),
SUM(temp.CURR_INTERESTPAID),
SUM(temp.CURR_CHARGES),
SUM(temp.PREV_MAXIMUM_BALANCE),
SUM(temp.PREV_MINIMUM_BALANCE),
SUM(temp.PREV_AVG_BALANCE),
SUM(temp.PREV_CREDIT_COUNTER ),
SUM(temp.PREV_INTERESTPAID),
SUM(temp.PREV_CHARGES
from
(select
case when AST.STATSPERIOD=2 then MAXIMUM_BALANCE else 0 END as CURR_MAXIMUM_BALANCE,
case when AST.STATSPERIOD=2 then MINIMUM_BALANCE else 0 END as CURR_MAXIMUM_BALANCE,
case when AST.STATSPERIOD=2 then AVG_BALANCE else 0 END as CURR_AVG_BALANCE,
case when AST.STATSPERIOD=2 then CREDIT_COUNTER else 0 END as CURR_CREDIT_COUNTER ,
case when AST.STATSPERIOD=2 then INTERESTPAID else 0 END as CURR_INTERESTPAID,
case when AST.STATSPERIOD=2 then CURR_CHARGES else 0 END as CURR_CHARGES,
case when AST.STATSPERIOD=2 then 0 ELSE MAXIMUM_BALANCE END as PREV_MAXIMUM_BALANCE,
case when AST.STATSPERIOD=2 then 0 ELSE MINIMUM_BALANCE END as PREV_MAXIMUM_BALANCE,
case when AST.STATSPERIOD=2 then 0 ELSE AVG_BALANCE END as PREV_AVG_BALANCE,
case when AST.STATSPERIOD=2 then 0 ELSE CREDIT_COUNTER END as PREV_CREDIT_COUNTER ,
case when AST.STATSPERIOD=2 then 0 ELSE INTERESTPAID END as PREV_INTERESTPAID,
case when AST.STATSPERIOD=2 then 0 ELSE PREV_CHARGES END as PREV_CHARGES,
FROM {?schemaName}.ACCOUNTSTATISTICS AST
[where statement discussed above]
group by ACCOUNTID



I guess I have made it very boring and complex to understand... But I hope it will help people understand how flexible can be usage of CASE statements...

I am looking at putting a content on improving INSERT performance... (this again has come from processing of rows from temp tables, where data is populated in bulk and deleted in bulk)...

An Alternative to truncate for pre v9.7

So many a times people have to truncate a table in run time, i.e. while processing a batch or a temp tabel or while flushing a table which stores some periodic stats etc... DB2 9.7 has an Amazingly fast command which will help you do that in flash of a blink called truncate... But ooops you are not using db2 v9.7... you had upgraded to db2 9 or db2 9.5 recently and can not take the effort required to port your application to new version and do all the testing again... No sweat... I will be sharing a few such tips which will help you imitate truncate table


1) Which is generic for all platforms...

ALTER TABLE tablename ACTIVATE NOT LOGGED INITIALLY WITH EMPTY TABLE;
commit;


but I am not very sure if space acquired by table is retained in this case

2) Obviously

ALTER TABLE tablename ACTIVATE NOT LOGGED INITIALLY;
delete from tablename;
commit;


3) This will be specific to a OS platform and I will not suggest using this if you have to code it in your application:

i) *NIX --> import from /dev/null of del replace into tablename
ii) Windows --> import from nul of del replace into tablename

This operation is though faster than normal delete but is logged and hence will not be as performant as first... Some people argue that second option is not recoverable, but we are looking at an alternative for truncate (which is not recoverable either) and hence recovery is not my priority... Moreover I have committed the transaction as soon as I fire "NOT LOGGED" command for "tablename" hence should not be an issue...