Tuesday, August 2, 2011

Analytic Functions: LAG( ) -- Grab a value from a previous row.

The LAG function can be used to return a value from the previous row. This becomes usefull if, for example, you want to calculate a difference between a previous measurment and a current one. Of course this implies that you properly define the order in which the rows are to occur. The LAG function, as other analytic functions, can opperate over the whole data-set; or groups (called partitions) can be defined so that for a particular row, the LAG function only looks for values within a specified subset of the entire data set.

Let's take for instance the following data set: measurements taken for two different people at different times.

-----> Example SQL Code Listing-#1:

   WITH
     w_collected_data  as
       ( select  NULL  as nm
              ,  NULL  as dts
              ,  NULL  as measure
           from  dual  where  1=0
         -----------------------------------------------------------------------------------------------
         union all select 'Liz', to_date('2011-05-12 08:12:00','YYYY-MM-DD HH24:MI:SS'), 31.4  from dual 
         union all select 'Liz', to_date('2011-05-12 10:21:00','YYYY-MM-DD HH24:MI:SS'), 31.1  from dual 
         union all select 'Bob', to_date('2011-05-12 10:23:00','YYYY-MM-DD HH24:MI:SS'), 23.5  from dual 
         union all select 'Bob', to_date('2011-05-12 11:32:00','YYYY-MM-DD HH24:MI:SS'), 24.9  from dual 
         union all select 'Liz', to_date('2011-05-12 12:16:00','YYYY-MM-DD HH24:MI:SS'), 30.7  from dual 
         union all select 'Bob', to_date('2011-05-12 12:44:00','YYYY-MM-DD HH24:MI:SS'), 26.1  from dual 
         union all select 'Bob', to_date('2011-05-12 13:18:00','YYYY-MM-DD HH24:MI:SS'), 27.3  from dual 
         union all select 'Bob', to_date('2011-05-12 14:27:00','YYYY-MM-DD HH24:MI:SS'), 27.1  from dual 
         union all select 'Liz', to_date('2011-05-12 14:36:00','YYYY-MM-DD HH24:MI:SS'), 29.5  from dual 
         union all select 'Bob', to_date('2011-05-12 15:13:00','YYYY-MM-DD HH24:MI:SS'), 26.8  from dual 
         union all select 'Liz', to_date('2011-05-12 17:01:00','YYYY-MM-DD HH24:MI:SS'), 29.2  from dual
         -----------------------------------------------------------------------------------------------
       )
   select  *
     from  w_collected_data
    order  by  nm, dts
   ;

-------------------------
-- Results
-------------------------

   NM   DTS                  MEASURE
   ---  -------------------  -------
   Bob  2011-05-12 10:23:00  23.5
   Bob  2011-05-12 11:32:00  24.9
   Bob  2011-05-12 12:44:00  26.1
   Bob  2011-05-12 13:18:00  27.3
   Bob  2011-05-12 14:27:00  27.1
   Bob  2011-05-12 15:13:00  26.8
   Liz  2011-05-12 08:12:00  31.4
   Liz  2011-05-12 10:21:00  31.1
   Liz  2011-05-12 12:16:00  30.7
   Liz  2011-05-12 14:36:00  29.5
   Liz  2011-05-12 17:01:00  29.2



Now let's suppose that we want to find the difference between one measurement to the next.

-----> Example SQL Code Listing-#2:

   WITH
     w_collected_data  as
       ( select  NULL  as nm
              ,  NULL  as dts
              ,  NULL  as measure
           from  dual  where  1=0
         -----------------------------------------------------------------------------------------------
         union all select 'Liz', to_date('2011-05-12 08:12:00','YYYY-MM-DD HH24:MI:SS'), 31.4  from dual 
         union all select 'Liz', to_date('2011-05-12 10:21:00','YYYY-MM-DD HH24:MI:SS'), 31.1  from dual 
         union all select 'Bob', to_date('2011-05-12 10:23:00','YYYY-MM-DD HH24:MI:SS'), 23.5  from dual 
         union all select 'Bob', to_date('2011-05-12 11:32:00','YYYY-MM-DD HH24:MI:SS'), 24.9  from dual 
         union all select 'Liz', to_date('2011-05-12 12:16:00','YYYY-MM-DD HH24:MI:SS'), 30.7  from dual 
         union all select 'Bob', to_date('2011-05-12 12:44:00','YYYY-MM-DD HH24:MI:SS'), 26.1  from dual 
         union all select 'Bob', to_date('2011-05-12 13:18:00','YYYY-MM-DD HH24:MI:SS'), 27.3  from dual 
         union all select 'Bob', to_date('2011-05-12 14:27:00','YYYY-MM-DD HH24:MI:SS'), 27.1  from dual 
         union all select 'Liz', to_date('2011-05-12 14:36:00','YYYY-MM-DD HH24:MI:SS'), 29.5  from dual 
         union all select 'Bob', to_date('2011-05-12 15:13:00','YYYY-MM-DD HH24:MI:SS'), 26.8  from dual 
         union all select 'Liz', to_date('2011-05-12 17:01:00','YYYY-MM-DD HH24:MI:SS'), 29.2  from dual
         -----------------------------------------------------------------------------------------------
       )
   select  nm
        ,  dts
        ,  measure
        ,            lag( measure ) over ( order by dts )  as previous
        ,  measure - lag( measure ) over ( order by dts )  as msr_increase
     from  w_collected_data
    order  by  dts
   ;

-------------------------
-- Results
-------------------------

   NM   DTS                  MEASURE  PREVIOUS  MSR_INCREASE
   ---  -------------------  -------  --------  ------------
   Liz  2011-05-12 08:12:00  31.4
   Liz  2011-05-12 10:21:00  31.1     31.4      -0.3
   Bob  2011-05-12 10:23:00  23.5     31.1      -7.6
   Bob  2011-05-12 11:32:00  24.9     23.5       1.4
   Liz  2011-05-12 12:16:00  30.7     24.9       5.8
   Bob  2011-05-12 12:44:00  26.1     30.7      -4.6
   Bob  2011-05-12 13:18:00  27.3     26.1       1.2
   Bob  2011-05-12 14:27:00  27.1     27.3      -0.2
   Liz  2011-05-12 14:36:00  29.5     27.1       2.4
   Bob  2011-05-12 15:13:00  26.8     29.5      -2.7
   Liz  2011-05-12 17:01:00  29.2     26.8       2.4


Notice that the "lag(measure)" gives us the previous measure where the "over (order by dts)" defines how the rows are to be sorted when finding the previous row. The entire analytic function given by "lag(measure) over (order by dts)" returns the previous row's measure value if the rows were to be sorted by "dts". The first row has no previous value because there was no previous row to pull it from -- it's value is set to NULL.

In an analytic function the "over (order by *)" is only used to determine which row the value will be taken from -- it doesn't determine how the final output data-set will have its rows sorted; that is always determined in the "Order By" clause of the query.

Also, notice that this new analytic function, "lag(measure) over (order by dts)", can now be used in a calculation (the measure minus the previous measure) to show how much the measure changed from one row to the next.



Next we might realize that the Bob's measurements don't relate to the measurements of Liz. We would like to look at the two people's measurements seperately -- not inner-mixed. This means we will need to split-up the data by the "NM" column -- in analytic functions we do this via the "partition by" operator within the "over" clause.

-----> Example SQL Code Listing-#3:

   WITH
     w_collected_data  as
       ( select  NULL  as nm
              ,  NULL  as dts
              ,  NULL  as measure
           from  dual  where  1=0
         -----------------------------------------------------------------------------------------------
         union all select 'Liz', to_date('2011-05-12 08:12:00','YYYY-MM-DD HH24:MI:SS'), 31.4  from dual 
         union all select 'Liz', to_date('2011-05-12 10:21:00','YYYY-MM-DD HH24:MI:SS'), 31.1  from dual 
         union all select 'Bob', to_date('2011-05-12 10:23:00','YYYY-MM-DD HH24:MI:SS'), 23.5  from dual 
         union all select 'Bob', to_date('2011-05-12 11:32:00','YYYY-MM-DD HH24:MI:SS'), 24.9  from dual 
         union all select 'Liz', to_date('2011-05-12 12:16:00','YYYY-MM-DD HH24:MI:SS'), 30.7  from dual 
         union all select 'Bob', to_date('2011-05-12 12:44:00','YYYY-MM-DD HH24:MI:SS'), 26.1  from dual 
         union all select 'Bob', to_date('2011-05-12 13:18:00','YYYY-MM-DD HH24:MI:SS'), 27.3  from dual 
         union all select 'Bob', to_date('2011-05-12 14:27:00','YYYY-MM-DD HH24:MI:SS'), 27.1  from dual 
         union all select 'Liz', to_date('2011-05-12 14:36:00','YYYY-MM-DD HH24:MI:SS'), 29.5  from dual 
         union all select 'Bob', to_date('2011-05-12 15:13:00','YYYY-MM-DD HH24:MI:SS'), 26.8  from dual 
         union all select 'Liz', to_date('2011-05-12 17:01:00','YYYY-MM-DD HH24:MI:SS'), 29.2  from dual
         -----------------------------------------------------------------------------------------------
       )
   select  nm
        ,  dts
        ,  measure
        ,               lag( measure ) over (partition by nm order by dts)  as previous
        ,     measure - lag( measure ) over (partition by nm order by dts)  as msr_increase
        ,  round((dts - lag( dts     ) over (partition by nm order by dts))*24*60)  as min_between
     from  w_collected_data
    order  by  nm, dts
   ;

-------------------------
-- Results
-------------------------

   NM   DTS                  MEASURE  PREVIOUS  MSR_INCREASE  MIN_BETWEEN
   ---  -------------------  -------  --------  ------------  -----------
   Bob  2011-05-12 10:23:00  23.5
   Bob  2011-05-12 11:32:00  24.9     23.5       1.4           69
   Bob  2011-05-12 12:44:00  26.1     24.9       1.2           72
   Bob  2011-05-12 13:18:00  27.3     26.1       1.2           34
   Bob  2011-05-12 14:27:00  27.1     27.3      -0.2           69
   Bob  2011-05-12 15:13:00  26.8     27.1      -0.3           46
   Liz  2011-05-12 08:12:00  31.4
   Liz  2011-05-12 10:21:00  31.1     31.4      -0.3          129
   Liz  2011-05-12 12:16:00  30.7     31.1      -0.4          115
   Liz  2011-05-12 14:36:00  29.5     30.7      -1.2          140
   Liz  2011-05-12 17:01:00  29.2     29.5      -0.3          145


Now we see two seperate groups of data, one for Bob and one for Liz. In an analytic function we can 'partition by' mutliple values/columns and sort by multiple values/columns. This makes the analytic functions very powerful. For instance we might have a LAG function that looks like this: "lag(measure) over (partition by region, store, clerk order by product, sales_date)".

Additionally, the previous example added a column to show how many minutes were between the measurments using this LAG function: "lag(dts) over (partition by nm order by dts)".



Another feature of the LAG function is that you can specify how many rows to 'look-back'. The default is 1 -- or the previous row; but you can change that to 2 as shown in the following example. A third parameter can also be supplied as a defauld value (instead of NULL) for rows that don't have a previous value.

-----> Example SQL Code Listing-#4:

   WITH
     w_collected_data  as
       ( select  NULL  as nm
              ,  NULL  as dts
              ,  NULL  as measure
           from  dual  where  1=0
         -----------------------------------------------------------------------------------------------
         union all select 'Liz', to_date('2011-05-12 08:12:00','YYYY-MM-DD HH24:MI:SS'), 31.4  from dual 
         union all select 'Liz', to_date('2011-05-12 10:21:00','YYYY-MM-DD HH24:MI:SS'), 31.1  from dual 
         union all select 'Bob', to_date('2011-05-12 10:23:00','YYYY-MM-DD HH24:MI:SS'), 23.5  from dual 
         union all select 'Bob', to_date('2011-05-12 11:32:00','YYYY-MM-DD HH24:MI:SS'), 24.9  from dual 
         union all select 'Liz', to_date('2011-05-12 12:16:00','YYYY-MM-DD HH24:MI:SS'), 30.7  from dual 
         union all select 'Bob', to_date('2011-05-12 12:44:00','YYYY-MM-DD HH24:MI:SS'), 26.1  from dual 
         union all select 'Bob', to_date('2011-05-12 13:18:00','YYYY-MM-DD HH24:MI:SS'), 27.3  from dual 
         union all select 'Bob', to_date('2011-05-12 14:27:00','YYYY-MM-DD HH24:MI:SS'), 27.1  from dual 
         union all select 'Liz', to_date('2011-05-12 14:36:00','YYYY-MM-DD HH24:MI:SS'), 29.5  from dual 
         union all select 'Bob', to_date('2011-05-12 15:13:00','YYYY-MM-DD HH24:MI:SS'), 26.8  from dual 
         union all select 'Liz', to_date('2011-05-12 17:01:00','YYYY-MM-DD HH24:MI:SS'), 29.2  from dual
         -----------------------------------------------------------------------------------------------
       )
   select  nm
        ,  dts
        ,  measure
        ,  lag( measure, 2, 0 ) over (partition by nm order by dts)  as previous
     from  w_collected_data
    order  by  nm, dts
   ;

-------------------------
-- Results
-------------------------

   NM   DTS                  MEASURE  PREVIOUS
   ---  -------------------  -------  --------
   Bob  2011-05-12 10:23:00  23.5     0
   Bob  2011-05-12 11:32:00  24.9     0
   Bob  2011-05-12 12:44:00  26.1     23.5
   Bob  2011-05-12 13:18:00  27.3     24.9
   Bob  2011-05-12 14:27:00  27.1     26.1
   Bob  2011-05-12 15:13:00  26.8     27.3
   Liz  2011-05-12 08:12:00  31.4     0
   Liz  2011-05-12 10:21:00  31.1     0
   Liz  2011-05-12 12:16:00  30.7     31.4
   Liz  2011-05-12 14:36:00  29.5     31.1
   Liz  2011-05-12 17:01:00  29.2     30.7




One last example. Here's a data-set that shows the ID numbers for 3 different runners passing through checkpoints 0-3. A LAG function is used to obtain the previous "clock" value, and then a calculation is made with that function to show the minutes between each check-point.

-----> Example SQL Code Listing-#5:

   WITH
     w_check_points  as
       ( select  NULL  as id
              ,  NULL  as ckpt
              ,  NULL  as clock
           from  dual  where 1=0
         ----------------------------------------------
         union all select  12, 0, '00:00:00'  from dual 
         union all select  47, 0, '00:00:00'  from dual 
         union all select  83, 0, '00:00:00'  from dual 
         union all select  12, 1, '01:00:16'  from dual 
         union all select  47, 1, '00:43:27'  from dual 
         union all select  83, 1, '00:40:18'  from dual 
         union all select  12, 2, '02:12:47'  from dual 
         union all select  47, 2, '01:32:53'  from dual 
         union all select  83, 2, '01:49:27'  from dual 
         union all select  12, 3, '03:27:14'  from dual 
         union all select  47, 3, '02:25:04'  from dual 
         union all select  83, 3, '03:31:38'  from dual
         ----------------------------------------------
       )
   select  id
        ,  ckpt
        ,  clock
        ,  lag(clock) over (partition by id  order by ckpt)  as prev_clock
        ,  round
             ( ( to_date(clock, 'HH24:MI:SS') 
               - to_date(lag(clock) over (partition by id order by ckpt), 'HH24:MI:SS')
               ) * 24 * 60
             , 3 
             )  as min
     from  w_check_points
    order  by  id, ckpt
   ;

-------------------------
-- Results
-------------------------

   ID   CKPT  CLOCK     PREV_CLOCK  MIN
   ---  ----  --------  ----------  ---------
   12   0     00:00:00
   12   1     01:00:16  00:00:00     60.267
   12   2     02:12:47  01:00:16     72.517
   12   3     03:27:14  02:12:47     74.45
   47   0     00:00:00
   47   1     00:43:27  00:00:00     43.45
   47   2     01:32:53  00:43:27     49.433
   47   3     02:25:04  01:32:53     52.183
   83   0     00:00:00
   83   1     00:40:18  00:00:00     40.3
   83   2     01:49:27  00:40:18     69.15
   83   3     03:31:38  01:49:27    102.183

1 comment:

  1. very nice option , this articular is very user for best example.

    ReplyDelete