Friday, September 30, 2011

Oracle PIVOT and UNPIVOT to Invert Data

Note: Prior to going through this example, you may wish to make sure you understand how to use Oracle's basic PIVOT and UNPIVOT clauses.

Let's suppose we have some data that we would like to "Invert" -- the rows become the columns and the columns become the rows. Here's a simple example that has account numbers for rows and quarters as columns. Note that the "Inversion" is shown with the quarters as rows and the account numbers as the columns, and the measures have been retained to correspond with the same quarters and accounts:





So, here is our sample data and the first part of the solution. We'll do an UNPIVOT to bring back all the measures into a single column:

/* ===== code ===== */     
   
   WITH
     w_data  as
       ( select  11 acct, 23.29 q_1, 81.72 q_2, 16.72 q_3, 60.00 q_4  from dual union all
         select  35     , null     , 11.53    , 51.27    , 76.14      from dual union all
         select  44     , 53.63    , 38.58    , 29.76    , 47.12      from dual
       )
   select  *
     from  w_data
  UNPIVOT  include nulls
           (   m1   FOR   qtr  IN
             ( q_1   as   1
             , q_2   as   2
             , q_3   as   3
             , q_4   as   4
             )
           )
    order  by 1 ;


Notice that we created the QTR column and assigned the proper identifying values to the previous column names. We now have the two identifiers (ACCT and QTR) and one column that contains all the measures:





For the next step we could create a sub-query with the previous code and then modify that result with a PIVOT clause. However, we can have both a PIVOT and an UNPIVOT clause in the same query.

/* ===== code ===== */     
   
   WITH
     w_data  as
       ( select  11 acct, 23.29 q_1, 81.72 q_2, 16.72 q_3, 60.00 q_4  from dual union all
         select  35     , null     , 11.53    , 51.27    , 76.14      from dual union all
         select  44     , 53.63    , 38.58    , 29.76    , 47.12      from dual
       )
   select  *
     from  w_data
  UNPIVOT  include nulls
           (   m1   FOR   qtr  IN
             ( q_1   as   1
             , q_2   as   2
             , q_3   as   3
             , q_4   as   4
             )
           )
    PIVOT  ( sum(m1) 
           FOR  acct IN
             (  11   as  acct_11
             ,  35   as  acct_35
             ,  44   as  acct_44
             )
           ) 
    order  by 1 ;


Here in the added PIVOT clause we create the three new columns for the three account values. We must use an aggregate function on the measure (sum(m1),  max(m1), ...) because the PIVOT clause turns the query into an aggregate (group-by) query. The QTR column remains, but the ACCT column is replaced by the three, more specific, account columns. The data-set has now been "Inverted":



Oracle's UNPIVOT Clause -- Going Further

Let's start with a nice little data set that has some un-pivot potential:

/* ===== code ===== */     
   
   WITH
     w_data  as
       ( select  'NY' st, 1 rgn, 13 q1cnt, 245.25 q1amt, 13 q2cnt, 245.25 q2amt, 13 q3cnt, 245.25 q3amt, 13 q4cnt, 245.25 q4amt  from dual union all
         select  'NY'   , 2    , 43      , 760.34      , 38      , 842.11      , 39      , 425.98      , 54      , 571.53        from dual union all
         select  'NY'   , 3    , 34      , 743.21      , 38      , 627.24      , null    , 463.74      , 41      , 154.96        from dual union all
         select  'HI'   , 1    , 14      , 256.13      , 11      , 532.48      , 10      , 246.37      , 18      , 261.37        from dual union all
         select  'OR'   , 1    , 73      , 758.14      , 71      , 132.84      , 63      , 623.74      , 82      , 851.92        from dual union all
         select  'OR'   , 2    , null    , null        , 62      , 275.11      , 73      , 846.32      , 70      , 253.19        from dual union all
         select  'UT'   , 1    , 27      , 156.38      , 32      , 637.19      , 31      , 256.48      , 39      , 146.75        from dual union all
         select  'UT'   , 2    , 24      , 264.82      , 20      , 423.61      , 21      , null        , 18      , 367.88        from dual union all
         select  'MT'   , 1    ,  3      , 374.28      ,  8      , 125.35      ,  5      , 423.66      ,  7      , 536.14        from dual 
       )
   select * from w_data ;
   

Here we have a couple of identifying columns (State and Region) that don't really have anything to do with the Un-Pivot -- they will just need to be associated with any measure that appears in their row. We also have eight columns of measures, two measures for each quarter of the year (Count and Amount). Un-Pivoting the data could mean that we take all these measures and un-pivot them into one single measures column, or it could mean that we un-pivot the four "Count" columns into one measures column and the four "Amount" columns into another measures column.





For the first example, let's start with that last one mentioned -- we'll un-pivot the data into two measure columns. We'll have a single "CNT" column and a seperate "AMT" column for the measures. We'll also create a new "QTR" colum that will contain an identifier that tells us which quarter the two measures came from.

Here is the solution. See the explanation and results below:

/* ===== code ===== */     
   
   WITH
     w_data  as
       ( select  'NY' st, 1 rgn, 13 q1cnt, 245.25 q1amt, 13 q2cnt, 245.25 q2amt, 13 q3cnt, 245.25 q3amt, 13 q4cnt, 245.25 q4amt  from dual union all
         select  'NY'   , 2    , 43      , 760.34      , 38      , 842.11      , 39      , 425.98      , 54      , 571.53        from dual union all
         select  'NY'   , 3    , 34      , 743.21      , 38      , 627.24      , null    , 463.74      , 41      , 154.96        from dual union all
         select  'HI'   , 1    , 14      , 256.13      , 11      , 532.48      , 10      , 246.37      , 18      , 261.37        from dual union all
         select  'OR'   , 1    , 73      , 758.14      , 71      , 132.84      , 63      , 623.74      , 82      , 851.92        from dual union all
         select  'OR'   , 2    , null    , null        , 62      , 275.11      , 73      , 846.32      , 70      , 253.19        from dual union all
         select  'UT'   , 1    , 27      , 156.38      , 32      , 637.19      , 31      , 256.48      , 39      , 146.75        from dual union all
         select  'UT'   , 2    , 24      , 264.82      , 20      , 423.61      , 21      , null        , 18      , 367.88        from dual union all
         select  'MT'   , 1    ,  3      , 374.28      ,  8      , 125.35      ,  5      , 423.66      ,  7      , 536.14        from dual 
       )
   select  st,  rgn
        ,  qtr
        ,  cnt,  amt
     from  w_data
  UNPIVOT  include nulls
             (   (   cnt,   amt ) FOR  qtr  IN
               ( ( q1cnt, q1amt )  as   1 
               , ( q2cnt, q2amt )  as   2 
               , ( q3cnt, q3amt )  as   3 
               , ( q4cnt, q4amt )  as   4 
               )
             )
  --where  nvl( cnt, 0 ) < 30
    order  by 1, 2, 3 ;


Explanation:

The "UNPIVOT" clause is placed immediately after the "FROM" clause. If you have a "WHERE" clause it will follow the "UNPIVOT" clause as shown.

The "Include Nulls" phrase will include an output row even if all the newly created measure columns are NULL. Take these two words out of the query and observe how the result-set changes.

Inside the "UNPIVOT" clause we start by creating the set of new measure columns (CNT and AMT). Notice that because we are creating a set (more than one) of new measures we put them in parentesis. These columns will contain all the un-pivoted measures. After the keyword "FOR" we create a new column (QTR) that will contain an identifier that tells us which quarter the measures came from (which original set of columns the specific count and amount measures came from).

After the "IN" keyword we include a parenthesis enclosed list that associates each set of original measure columns with an identifying value. For example, the original columns (Q1CNT and Q1AMT) are assigned the identifier of "1" -- this is the value that will appear in the new QTR column allong side the measures from those two columns.

And, one last item, if you un-comment (remove the "--" from the beginning of) the where clause, then the result-set will filter out any records that have a CNT of 30 or greater.





Notice that in the following example, all eight measure columns are un-pivoted into one single measure column (VAL). The new "VAL_ID" column will contain a text identifier that not only tells which quarter the measure comes from, but also tells us whether it is a "Count" or an "Amount" value.

/* ===== code ===== */     
   
   WITH
     w_data  as
       ( select  'NY' st, 1 rgn, 13 q1cnt, 245.25 q1amt, 13 q2cnt, 245.25 q2amt, 13 q3cnt, 245.25 q3amt, 13 q4cnt, 245.25 q4amt  from dual union all
         select  'NY'   , 2    , 43      , 760.34      , 38      , 842.11      , 39      , 425.98      , 54      , 571.53        from dual union all
         select  'NY'   , 3    , 34      , 743.21      , 38      , 627.24      , null    , 463.74      , 41      , 154.96        from dual union all
         select  'HI'   , 1    , 14      , 256.13      , 11      , 532.48      , 10      , 246.37      , 18      , 261.37        from dual union all
         select  'OR'   , 1    , 73      , 758.14      , 71      , 132.84      , 63      , 623.74      , 82      , 851.92        from dual union all
         select  'OR'   , 2    , null    , null        , 62      , 275.11      , 73      , 846.32      , 70      , 253.19        from dual union all
         select  'UT'   , 1    , 27      , 156.38      , 32      , 637.19      , 31      , 256.48      , 39      , 146.75        from dual union all
         select  'UT'   , 2    , 24      , 264.82      , 20      , 423.61      , 21      , null        , 18      , 367.88        from dual union all
         select  'MT'   , 1    ,  3      , 374.28      ,  8      , 125.35      ,  5      , 423.66      ,  7      , 536.14        from dual 
       )
   select  st,  rgn
        ,  val_id,  val
     from  w_data
  UNPIVOT  include nulls
             (   val    FOR   val_id        IN
               ( q1cnt   as  'Qtr-1 Count'
               , q2cnt   as  'Qtr-2 Count'
               , q3cnt   as  'Qtr-3 Count'
               , q4cnt   as  'Qtr-4 Count'
               , q1amt   as  'Qtr-1 Amount'
               , q2amt   as  'Qtr-2 Amount'
               , q3amt   as  'Qtr-3 Amount'
               , q4amt   as  'Qtr-4 Amount'
               )
             )
    order  by 1, 2, 3 ;


/* --- Partial Results --- */

     --  ST  RGN  VAL_ID        VAL
     --  --  ---  ------------  -------
     --  HI  1    Qtr-1 Amount  256.13
     --  HI  1    Qtr-1 Count   14
     --  HI  1    Qtr-2 Amount  532.48
     --  HI  1    Qtr-2 Count   11
     --  HI  1    Qtr-3 Amount  246.37
     --  HI  1    Qtr-3 Count   10
     --  HI  1    Qtr-4 Amount  261.37
     --  HI  1    Qtr-4 Count   18
     --  MT  1    Qtr-1 Amount  374.28
     --  MT  1    Qtr-1 Count   3
     --  ...




This next example is very similar to the last one. In this query, there are two identifying columns created (VAL_QTR and VAL_TYP) -- the identifying information here is split between these two columns. The first tells us which quarter the measure came from, and the second column tells us whether it was a "Count" or an "Amount" value.

/* ===== code ===== */     
   
   WITH
     w_data  as
       ( select  'NY' st, 1 rgn, 13 q1cnt, 245.25 q1amt, 13 q2cnt, 245.25 q2amt, 13 q3cnt, 245.25 q3amt, 13 q4cnt, 245.25 q4amt  from dual union all
         select  'NY'   , 2    , 43      , 760.34      , 38      , 842.11      , 39      , 425.98      , 54      , 571.53        from dual union all
         select  'NY'   , 3    , 34      , 743.21      , 38      , 627.24      , null    , 463.74      , 41      , 154.96        from dual union all
         select  'HI'   , 1    , 14      , 256.13      , 11      , 532.48      , 10      , 246.37      , 18      , 261.37        from dual union all
         select  'OR'   , 1    , 73      , 758.14      , 71      , 132.84      , 63      , 623.74      , 82      , 851.92        from dual union all
         select  'OR'   , 2    , null    , null        , 62      , 275.11      , 73      , 846.32      , 70      , 253.19        from dual union all
         select  'UT'   , 1    , 27      , 156.38      , 32      , 637.19      , 31      , 256.48      , 39      , 146.75        from dual union all
         select  'UT'   , 2    , 24      , 264.82      , 20      , 423.61      , 21      , null        , 18      , 367.88        from dual union all
         select  'MT'   , 1    ,  3      , 374.28      ,  8      , 125.35      ,  5      , 423.66      ,  7      , 536.14        from dual 
       )
   select  st,  rgn
        ,  val_qtr, val_typ, val
     from  w_data
  UNPIVOT  include nulls
             (   val    FOR  ( val_qtr, val_typ  ) IN
               ( q1cnt   as  ( 'Qtr-1', 'Count'  )
               , q2cnt   as  ( 'Qtr-2', 'Count'  )
               , q3cnt   as  ( 'Qtr-3', 'Count'  )
               , q4cnt   as  ( 'Qtr-4', 'Count'  )
               , q1amt   as  ( 'Qtr-1', 'Amount' )
               , q2amt   as  ( 'Qtr-2', 'Amount' )
               , q3amt   as  ( 'Qtr-3', 'Amount' )
               , q4amt   as  ( 'Qtr-4', 'Amount' )
               )
             )
    order  by 1, 2, 3, 4 ;


/* --- Partial Results --- */

     --  ST  RGN  VAL_QTR  VAL_TYP  VAL
     --  --  ---  -------  -------  -------
     --  HI  1    Qtr-1    Amount   256.13
     --  HI  1    Qtr-1    Count    14
     --  HI  1    Qtr-2    Amount   532.48
     --  HI  1    Qtr-2    Count    11
     --  HI  1    Qtr-3    Amount   246.37
     --  HI  1    Qtr-3    Count    10
     --  HI  1    Qtr-4    Amount   261.37
     --  HI  1    Qtr-4    Count    18
     --  MT  1    Qtr-1    Amount   374.28
     --  MT  1    Qtr-1    Count    3
     --  ...




This final example shows us that we can create as many identifying columns as we wish to include; and that we can define their values to just about anything we would need.

/* ===== code ===== */     
   
   WITH
     w_data  as
       ( select  'NY' st, 1 rgn, 13 q1cnt, 245.25 q1amt, 13 q2cnt, 245.25 q2amt, 13 q3cnt, 245.25 q3amt, 13 q4cnt, 245.25 q4amt  from dual union all
         select  'NY'   , 2    , 43      , 760.34      , 38      , 842.11      , 39      , 425.98      , 54      , 571.53        from dual union all
         select  'NY'   , 3    , 34      , 743.21      , 38      , 627.24      , null    , 463.74      , 41      , 154.96        from dual union all
         select  'HI'   , 1    , 14      , 256.13      , 11      , 532.48      , 10      , 246.37      , 18      , 261.37        from dual union all
         select  'OR'   , 1    , 73      , 758.14      , 71      , 132.84      , 63      , 623.74      , 82      , 851.92        from dual union all
         select  'OR'   , 2    , null    , null        , 62      , 275.11      , 73      , 846.32      , 70      , 253.19        from dual union all
         select  'UT'   , 1    , 27      , 156.38      , 32      , 637.19      , 31      , 256.48      , 39      , 146.75        from dual union all
         select  'UT'   , 2    , 24      , 264.82      , 20      , 423.61      , 21      , null        , 18      , 367.88        from dual union all
         select  'MT'   , 1    ,  3      , 374.28      ,  8      , 125.35      ,  5      , 423.66      ,  7      , 536.14        from dual 
       )
   select  st,   rgn
        ,  qtr,  qnm,  qrn,  qwrd,  qna,  qnb
        ,  cnt,  amt
     from  w_data
  UNPIVOT  include nulls
             (   (   cnt,   amt )  FOR  ( qtr ,        qnm  ,  qrn  ,   qwrd  ,  qna ,  qnb ) IN
               ( ( q1cnt, q1amt )   as  (   1 , 'Quarter-1' ,   'I' ,   'One' , 0.00 , 0.25 )
               , ( q2cnt, q2amt )   as  (   2 , 'Quarter-2' ,  'II' ,   'Two' , 0.25 , 0.50 )
               , ( q3cnt, q3amt )   as  (   3 , 'Quarter-3' , 'III' , 'Three' , 0.50 , 0.75 )
               , ( q4cnt, q4amt )   as  (   4 , 'Quarter-4' ,  'IV' ,  'Four' , 0.75 , 1.00 )
               )
             )
    order  by 1, 2, 3 ;


/* --- Partial Results --- */

     --  ST  RGN  QTR  QNM        QRN  QWRD   QNA    QNB    CNT  AMT
     --  --  ---  ---  ---------  ---  -----  -----  -----  ---  -------
     --  HI  1    1    Quarter-1  I    One    0      0.25    14  256.13
     --  HI  1    2    Quarter-2  II   Two    0.25   0.5     11  532.48
     --  HI  1    3    Quarter-3  III  Three  0.5    0.75    10  246.37
     --  HI  1    4    Quarter-4  IV   Four   0.75   1       18  261.37
     --  MT  1    1    Quarter-1  I    One    0      0.25     3  374.28
     --  MT  1    2    Quarter-2  II   Two    0.25   0.5      8  125.35
     --  ...


Thursday, September 22, 2011

Oracle's UNPIVOT Clause -- A Simple Example


To "Un-Pivot" data generally means to take the values from multiple columns and place them in a single column.

Here we have a table that contains measures for each quarter of the year for various accounts. We would like to 'un-pivot' the data so that all the measures that are currently contained in the four columns ( Q1, Q2, Q3, and Q4 ) will be located in one single column. In order to not lose any information we will also need a new column that will contain an indicator that tells us which column (which quarter, in this case) the specific measure came from.




Here's the "UNPIVOT" query (it also contains the sample data above) that will do the trick:

/* ===== code ===== */     
   

   WITH
     w_data  as
       ( select  11 acct, 23.29 q1, 81.72 q2, 16.72 q3, 60.00 q4  from dual union all
         select  35     , null    , 11.53   , 51.27   , 76.14     from dual union all
         select  44     , 53.63   , 38.58   , 29.76   , 47.12     from dual
       )
   select  acct,  qtr,  m1
     from  w_data
  UNPIVOT  include nulls
           (   m1  FOR  qtr  IN
             ( q1   as   1
             , q2   as   2
             , q3   as   3
             , q4   as   4
             )
           )
    order  by 1, 2 ;



Explanation: 

We add the "UNPIVOT" clause immediately after the "FROM" clause. If we want a place-holder for each account and quarter (even if the measure is NULL) then we can include the phrase "Include Nulls" as shown. Inside the main parenthesis of the UNPIVOT clause we first create two new columns. Prior to the keyword "FOR" we create the column that will contain the measures (m1). After the keyword "FOR" we create the column that will contain the identifying values that tell us which original column the measures came from (qtr). Then, following the keyword "IN" we list the columns and idetifying values as shown above. The following is an English translation of what is being said in this "IN" clause:

          Values in col. Q1 will be placed in the new col. M1 and identified by a 1 in col. QTR.
          Values in col. Q2 will be placed in the new col. M1 and identified by a 2 in col. QTR.
          Values in col. Q3 will be placed in the new col. M1 and identified by a 3 in col. QTR.
          Values in col. Q4 will be placed in the new col. M1 and identified by a 4 in col. QTR.

And, of course, the value in the ACCT column will stay the same for each measure brought into the result-set. The result of this query is shown below:





The previous example shows how a simple un-pivot statement works. In more complex un-pivot queries, you can have as many columns as you wish in order to show the desired identifying values and you can pivot the multiple columns in your source data into multiple measure columns in the result set. (More to come.)

Thursday, August 18, 2011

Use Regular Expressions to Extract CSV-Type Values

Here are a couple of quick examples of how to extract the individual elements of a CSV-like string using Oracle's regular expressions.



The first example uses a rather simple regular expression, but if there is a comma within one of the values -- it will fail (for example: "Jones, Joe"). It will account for either having the values enclosed in quotes or not, but the comma can only be there to seperate values.

The second example corrects this issue, but the regular expression has to be just a bit more complicated. An explanation of each of the two regular expressions is included at the end.



The first example below shows the sample data in the first "WITH" sub-query. Notice that the data is intentionally messy; it has extra spaces and sometimes includes double-quote marks around the values.

The second "WITH" sub-query uses the "REGEXP_REPLACE" function to extract one item at a time from the entire string. The regular expression must match with each record of the source data. The internal sets of parenthesis in the expression identify the four seperate data elements. In the function, the last parameter (ie: '\3') tells it to replace the entire CSV record with that one particular value (identified by one of the four sets of parenthesis within the regular expression). So, when this subquery finishes, we now have the CSV-type records broken down into four text columns (c1, c2, c3, c4).

The third "WITH" sub-query trims off extra spaces and quote marks, and then casts each data element to its proper data type. Notice that the DATE strings all need to match the format pattern given here to be able to successfully convert them into date data-types.

/* ===== code ===== */
 
   WITH /* Extract and Cast CSV Data */
     w_csv  as
       ( select  '        ,        ,        ,        '                          x from dual union all
         select  '    8  , 37.251  , "Joe Jones"   , "2009-08-18 13:27:33" '      from dual union all
         select  '   ""  ,  6.471  ,  Sam Smith    , "2010-12-30 16:13:39" '      from dual union all
         select  '   10  ,  0      ,               ,                       '      from dual union all
         select  '   19  , "0.25"  ,  Bob Benson   , " 2003-10-24 23:59:01 " '    from dual union all
         select  '  "47" , 13.23   ,  " "          ,  2007-05-02 22:58:02  '      from dual union all
         select  '  106  , 44.732  ,  "Liz Loyd"   , ""                    '      from dual union all
         select  ' 2249  ,   ""    , "Nelson Ned"  ,  1997-06-05 04:42:26  '      from dual union all
         select  '" 32154 ",0.001,"Fizwig Ray M","1913-02-04 18:01:00"'           from dual union all
         select  '"","42.1234",Sam B Smith,"2010-12-30 16:13:39"'                 from dual union all
         select  '"  12  ",2.4,"",2007-05-02 22:58:02'                            from dual union all
         select  '98,".9732","Jan James",""'                                      from dual union all
         select  ',,,'                                                            from dual
       )
   , w_elements  as
       ( select  regexp_replace(x,'^(.*),(.*),(.*),(.*)$','\1')  as c1
              ,  regexp_replace(x,'^(.*),(.*),(.*),(.*)$','\2')  as c2
              ,  regexp_replace(x,'^(.*),(.*),(.*),(.*)$','\3')  as c3
              ,  regexp_replace(x,'^(.*),(.*),(.*),(.*)$','\4')  as c4
           from  w_csv
       )
   , w_data  as
       ( select  to_number(trim('"' from trim(c1)))  as id
              ,  to_number(trim('"' from trim(c2)))  as val
              ,    to_char(trim('"' from trim(c3)))  as nam
              ,    to_date(trim('"' from trim(c4)), 'YYYY-MM-DD HH24:MI:SS')  as dts
           from  w_elements
       )
   SELECT  *
     from  w_data ;


The Results:




In this, next example, the regular expression will allow you to put a coma within a string if you enclose that string in double-quotes.

/* ===== code ===== */
 
   WITH /* Extract and Cast CSV Data */
     w_csv  as
       ( select  '        ,        ,        ,        '                          x from dual union all
         select  '    8  , 37.251  , "Joe Jones"   , "2009-08-18 13:27:33" '      from dual union all
         select  '   ""  ,  6.471  ,  Sam Smith    , "2010-12-30 16:13:39" '      from dual union all
         select  '   10  ,  0      ,               ,                       '      from dual union all
         select  '   19  , "0.25"  ,  Bob Benson   , " 2003-10-24 23:59:01 " '    from dual union all
         select  '  "47" , 13.23   ,  ","          ,  2007-05-02 22:58:02  '      from dual union all
         select  '  106  , 44.732  ,  "Liz Loyd"   , ""                    '      from dual union all
         select  ' 2249  ,   ""    , "Nelson, Ned"  ,  1997-06-05 04:42:26  '     from dual union all
         select  '" 32154 ",0.001,"Fizwig, Ray, M","1913-02-04 18:01:00"'         from dual union all
         select  '"","42.1234",Sam B Smith,"2010-12-30 16:13:39"'                 from dual union all
         select  '"  12  ",2.4,"",2007-05-02 22:58:02'                            from dual union all
         select  '98,".9732","Jan James,",""'                                     from dual union all
         select  ',,,'                                                            from dual
       )
   , w_elements  as
       ( select  regexp_replace(x,'^ *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *$','\1')  as c1
              ,  regexp_replace(x,'^ *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *$','\2')  as c2
              ,  regexp_replace(x,'^ *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *$','\3')  as c3
              ,  regexp_replace(x,'^ *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *$','\4')  as c4
           from  w_csv
       )
   , w_data  as
       ( select  to_number(trim('"' from trim(c1)))  as id
              ,  to_number(trim('"' from trim(c2)))  as val
              ,    to_char(trim('"' from trim(c3)))  as nam
              ,    to_date(trim('"' from trim(c4)), 'YYYY-MM-DD HH24:MI:SS')  as dts
           from  w_elements
       )
   SELECT  *
     from  w_data ;


The Results:





Here is an explanation of the two regular expressions used above in the "REGEXP_REPLACE" function.

 
 
Note that in these explanations a space is shown as an underscore (_).  
In the query's code, the space is a normal space character.

Regular Expression:  '^(.*),(.*),(.*)$'

a)  ^       The string begins,  followed by ...
b)  (       an open parenthesis (the start of a column's value),  followed by ...
c)  .*      zero or more (*) characters of anything (.)
d)  )       a closing parenthesis (the end of a column's value),  followed by ...
              (Note: the above {b through d} is repeated, seperated by commas, for each column of CSV data.)
e)  ,       a comma (,) if there are more columns,  followed by ...
f)  $       the end of the string ($).



Regular Expression:  '^_*(".*"|[^,]*)_*,_*(".*"|[^,]*)_*,_*(".*"|[^,]*)_*$'

a)  ^       The string begins,  followed by ...
b)  _*      zero or more (*) spaces (_),  followed by ...
c)  (       an open parenthesis (the start of a column's value),  followed by ...
d)  ".*"    zero or more (*) characters of anything (.) enclosed by quotes (")
e)  |       OR
f)  [^,]*   zero or more (*) characters of anything except a comma ([^,]),  followed by ...
g)  )       a closing parenthesis (the end of a column's value),  followed by ...
h)  _*      zero or more (*) spaces (_),  followed by ...
              (Note: the above {b through h} is repeated, seperated by commas, for each column of CSV data.)
i)  ,       a comma (,) if there are more columns,  followed by ...
j)  $       the end of the string ($).



Tuesday, August 16, 2011

Using SQL-Loader to Load CSV Data Into a Table.

Given a CSV (comma separated values) file of data, it's actually not that hard to use Oracle's "SQL-Loader" to insert it all into a table. The use of "SQL-Loader" requires that you have an Oracle Client properly installed on your machine.

Here's a small set of CSV data -- intentionally made just a little bit messy. Note that some of the fields are enclosed by quote marks and others are not; if a string has a comma within it (such as some of the names below) then it MUST be enclosed by quotes. Notice that the first row of data is actually the list of column names. And, take note of the date format.

If you want to run this demo yourself copy the data and other files below into one single folder on your machine as indicated. You can supply directory paths along with the file-names, but putting everything in one folder eliminates the need for that & makes things a little simpler. This demo was executed on a Windows machine, but it works very much the same on a Unix machine (of course you would use a shell script file instead of a batch file).

Create this file as "SourceData.csv":
ID_VAL, JUNK, NUM_VAL, PERSON_NM, THE_DATE
    8  ,yy, 37.251  , "Joe Jones"   , "2009-08-18 13:27:33" 
   ""  ,  zzz  ,  6.471  ,  Sam Smith    , "2010-12-30 16:13:39" 
      ,        ,        ,        ,        
   10  ,1,  0      ,               ,                       
   19  ,2, "0.25"  ,  Bob Benson   , "9999-99-99 99:99:99" 
  "47" , 3 , 13.23   ,  ","          ,  2007-05-02 22:58:02  
  106  , , 44.732  , "Liz Loyd"    , ""                    
 2249  ,"",   ""    , "Nelson, Ned" ,  1997-06-05 04:42:26  
"32154", ,0.001,"fizwig, Ray, M","1913-02-04 18:01:00"
"",abc,"42.1234",Sam b smiTh,"2010-12-30 16:13:39"
"12",~,2.4,",",2007-05-02 22:58:02
,,,,
98, ",",".9732","jAmes, jAn",""



We will need a table in which to load the data. Note that the column names do not need to match the names given in the CSV file. Execute this command in a SQL session -- you don't need to put this in a file on your machine.
/* ===== code ===== */
 
     Create table scott.loader_demo
       ( id_num      integer
       , some_dts    date
       , name_txt    varchar2(20)
       , extra_col   varchar2(5)
       , val_num     number
       ) ;



Next we will need a control file -- this is the heart of using "SQL-Loader". It defines where the data is coming from and where it is going; it contains all the details of how the load should proceed. In this example we see how to do a basic load of simple data-types. The following control file contains lots of comments -- all these, of course, can be stripped out.

Create this file as "ControlFile.ctl":
------------------------------------------------------------
-- SQL-Loader Control File
------------------------------------------------------------
OPTIONS
   ( skip=1                                    -- Note {op1}
   , errors=10                                 -- Note {op2}
-- , load=100                                  -- Note {op3}
   , direct=true                               -- Note {op4}
   , rows=10000                                -- Note {op5}
-- , bindsize=67108864                         -- Note {op6}
-- , readsize=67108864                         -- Note {op7}
   )
------------------------------------------------------------
UNRECOVERABLE                                  -- Note {L1}
LOAD DATA                                      -- Note {L2}
   INFILE        'SourceData.csv'              -- Note {L3}
   BADFILE       'LoadResults_BAD.log'         -- Note {L4}
   DISCARDFILE   'LoadResults_DISCARDED.log'   -- Note {L5}
   APPEND                                      -- Note {L6}
   INTO TABLE     scott.loader_demo            -- Note {L7}
-- WHEN           id_num <> ''                 -- Note {L8}
-------------------------------------------------------
Fields Terminated    by ","                    -- Note {x1}
Optionally Enclosed  by '"'                    -- Note {x2}
Trailing Nullcols                              -- Note {x3}
------------------------------------------------------------
                                               -- Note {c1}
   ( id_num
   , junk_01     FILLER                        -- Note {c2}
   , val_num
   , name_txt   "initcap(trim(:name_txt))"     -- Note {c3}
   , some_dts   DATE "YYYY-MM-DD HH24:MI:SS"   -- Note {c4}
   )
------------------------------------------------------------
-- {op1}: Skip the first <n> header rows (column names).
-- {op2}: Allow <n> errors before aborting the load.
-- {op3}: Load only this many rows, then stop.
-- {op4}: Direct=true can be much faster; data-blocks
--          are directly built onto the table.
--        Direct=false does conventional inserts;
--          Indexes updated as each row is inserted.
--        Indexes and Constraints affect the
--          performance of either load type.
-- {op5}: Load <n> rows between each commit.
-- {op6}: For conventional loads, setting this to a
--          large value can increase performance.
-- {op7}: Works in conjunction with 'bindsize'.
---------
-- {L1}:  Can speed up the load. In case of database
--          fail during load, loaded data is be lost.
-- {L2}:  Section to specify files and tables
-- {L3}:  The name of the CSV source data file.
-- {L4}:  Error Records will be written to this file.
-- {L5}:  Valid rows not loaded, written to this file.
--          (All-NULLs or rows filtered out by a WHEN)
-- {L6}:  Table load method, choose one of the following:
--          INSERT, APPEND, REPLACE, or TRUNCATE.
-- {L7}:  The name of the destination Oracle table.
-- {L8}:  Load only records that meet a condition.
---------
-- {x1}:  The field terminator ( usually <,> or <|> ).
-- {x2}:  The quote character ( usually <"> ).
-- {x3}:  Missing fields at end of record = NULL.
---------
-- {c1}:  This section is the column list. The number
--          and order of the columns should match your
--          CSV data, not necessarily the target table.
--        The column names in this list should be the
--          column names of your target table.
-- {c2}:  Use key word FILLER to ignore a column of CSV
--          data. The name in this list does not matter.
-- {c3}:  Example using functions on CSV values.
--          Use quotes around functions as shown.
-- {c4}:  Use key word DATE to convert CSV date values.
--          Format string in ("") must match CSV data.
------------------------------------------------------------


A further explanation of the control file is given at the end of this article.

The last thing we need to do is to execute the load -- the command can be issued from a command prompt; but in this example we put the command in a batch file. In Unix you could put the command in a shell-script.

In single quotes, you'll need to supply the proper 'user-name / password @ database-identifier' the last 2 elements point the load at your newly created control file and define the results log file. Now if you have a properly installed Oracle-Client, double-clicking on this batch file will execute the load.

If you don't want to put the password in the batch file just leave it out (as well as the preceding slash) and it will ask you for the password at run-time (i.e. 'scott @ DBSID' ).

Create this file as "GoLoad.bat": (Edit the Log-In information)
@echo off
echo.
sqlldr  'scott / tiger @ DBSID'  control='ControlFile.ctl'   log='LoadResults.log'
pause



When the load finishes, you can observe that the specified data has been inserted into your table:



Don't forget to examine the three 'LoadResults' log files. They contain information about the load process. You will be able to see which records didn't load & why.





Another interesting detail is that you can combine your data and the control file as shown in the following example. Here the "INFILE" is specified as "*" -- and you need to include the "BEGINDATA" just prior to the lines of CSV data. This would be executed in the same way the previous example was executed.

----------------------------------------------------------------------
-- SQL-Loader: A Control File that Includes the DATA.
----------------------------------------------------------------------
OPTIONS 
   ( skip=1
   , rows=10000
   , errors=10 
   )
----------------------------------------------------------------------
LOAD DATA
   INFILE        *
   BADFILE       'LoadResults_BAD.log'            
   DISCARDFILE   'LoadResults_DISCARDED.log'      
   APPEND                                         
   INTO TABLE     scott.loader_demo                               
----------------------------------------------------------------------
Fields Terminated    by ","                               
Optionally Enclosed  by '"'                               
Trailing Nullcols                                       
----------------------------------------------------------------------
   ( id_num
   , junk_01    FILLER
   , val_num  
   , name_txt   "initcap(trim(:name_txt))" 
   , some_dts   DATE "YYYY-Mon-DD HH24:MI" 
   )
----------------------------------------------------------------------
BEGINDATA
"ID_NUM","junk","A_VALUE","FULL_NAME","THE_DATE"
8,"a",37.251,"Joe jones","2009-Aug-18 13:27"
6,b,23.9,Ray Rulon,1984-Jan-23 02:37 
13  ,  c  ,  0.2  ,  Elie Ellison  , 1998-Feb-13 13:28
,d,6.471,sam SmItH,2010-Dec-30 16:13
47,"e",13.23,"","2007-May-02 22:58"    
993,"bad-date",274.51,"Bad, Bob","01/01/2001"
421,"f",24.5,"   Sanders, slY    ","2003-Sep-12 23:59" 
2,"g",1432.8,"Jan Jepson","2005-Jul-05 16:02" 
106,"h","44.732","Liz Loyd",""               
,,,,
1,,,,
27,"i",1.567,"bOb bEnnie bEnson","2000-Mar-30 10:44" 
7,,22.21,"Mel morris","1978-Apr-01 04:10" 
2249,"j",,"Nelson, ned","1997-Jun-05 04:42"
0,"",0,"Empty",






Additional notes on the control file:

The first section of the control-file is the "OPTIONS" -- here you set a few details of the load process. The "skip" option allows you to skip as many rows as you wish at the beginning of the CSV file. Normally it is used to skip one row of column names if they exist. If there are no column names, just specify "skip=0".

The "rows" option allows you to specify how many rows will be loaded between each "Commit". For most circumstances, something like "rows=10000" would work just fine.

The "errors" option allows you to specify how many errors will be tolerated before the whole load process is aborted. If your data is very bad and you just want to load the rows that will load, you could set this to a high value.

The "load" parameter is usually included while testing a load; it tells the process to load a specific number of rows and then quit -- even if the CSV file has many more rows of data.

The next section "LOAD DATA" is where you define the file-names (in single quotes) and the destination table name (not in quotes).

The "INFILE" is your CSV data file that contains the records you wish to load.

The "BADFILE" will contain all the records from the load process that 'errored-off' for some reason.

The "DISCARDFILE" will contain all the records that didn't load because all the columns were NULL or were disqualified by the "WHEN" clause.

The keyword "APPEND" will insert the new records into the table leaving existing records in the table. If you specify "TRUNCATE" instead of "APPEND" then the table will be truncated prior to loading the new records.

You can include an optional "WHEN" clause with a condition that controls which CSV records will be loaded. Some possible examples include: id_num <> '' , id_num = '8' . Remember that this is not a "WHERE" clause and you can't use Oracle functions here -- it is actually quite limited.

The next section defines the delimiter (usually a comma) and the quote marks that can be used to enclose a value. "Trailing Nullcols" tells the loader to interpret missing values at the end of records as NULLs.

The last and very important section of the control file is the column list. Make sure that the order and number of columns match your CSV data -- and that the column names match your destination table (except for "FILLER" columns). So, the position in your control-list tells SQL-Loader which column in the CSV file to get the value from, and its name in your control-list tells SQL-Loader which column in your destination table the value will be placed in.

If you have a CSV column that you wish to ignore, give it any name (junk, dummy, etc.) and put the keyword "FILLER" immediately after. This tells SQL-Loader to ignore that CSV column of data.

Notice that you do not need to insert data into every column in your destination table. Table columns not listed in the control-list will be left NULL.

Date values should be followed with the "DATE" keyword and the appropriate date format string. CSV records that have dates not matching this string will be rejected.

It is also possible to manipulate the data during the load process. Notice that you can use Oracle functions on the values by enclosing the expression in double quotes as shown. Make sure you repeat the column name preceded by a colon inside the function expression.


A few more possibilities in manipulating the data:

 
 Review & Going further... 
 
 In your column list, remember that the number and position of the 
 entries should match your CSV data file.
                                                      
    ( CSV_Col_1    
    , CSV_Col_2    
    , CSV_Col_3  
    , CSV_Col_4  
    , CSV_Col_5  
    )  
 
 Mark the CSV columns that you won't use with the "FILLER" keyword.
 The column names of unused filler data do not matter.
    
    ( CSV_Col_1   FILLER 
    , CSV_Col_2   FILLER 
    , CSV_Col_3  
    , CSV_Col_4  
    , CSV_Col_5  
    )
 
 The columns that you will use need to be labeled with the name of the
 table column where they will be loaded. 
 
    ( CSV_Col_1   FILLER 
    , CSV_Col_2   FILLER 
    , val_num  
    , name_txt  
    , some_dts  
    )
 
 
 Manipulating the CSV Values! 
 
 You can then add modifiers to alter the data. DATE values will
 need to have the format string included after the "DATE" keyword.
 To use incoming CSV values in an expression with Oracle functions,
 you refer to them with the name of their destination column as a 
 bind variable (ex ":name_txt"). With that as the column-name you
 then use Oracle functions and enclose the whole expression in 
 double-quotes as shown on the name_txt line.
 
    ( CSV_Col_1   FILLER 
    , CSV_Col_2   FILLER 
    , val_num  
    , name_txt    "initcap(trim( :name_txt ))"
    , some_dts    DATE "YYYY-MM-DD HH24:MI:SS" 
    )
 
 
 Date Manipulation!
 
 There are a couple of extra things you can do with a date column.
 Instead of using the DATE keyword, you can use the "to_date" function.
 You can even add more functions to manipulate the date as shown on
 the third row below (this would truncate the date and set it to the 
 last day of the previous month). Notice the fourth row shows how to
 assign the SYSDATE to a date column -- incoming CSV values are ignored.
    
    , some_dts    DATE "YYYY-MM-DD HH24:MI:SS" 
    , some_dts    "to_date( :some_dts ,'YYYY-MM-DD HH24:MI:SS')" 
    , some_dts    "trunc(to_date( :some_dts ,'YYYY-MM-DD HH24:MI:SS'),'MM')-1" 
    , some_dts    "sysdate"
   

 Using Constants!

 You can also put a constant number in a numeric field or a constant
 string value in a Varchar2 field -- ignoring the CSV values.   
 Enclose numeric constants in double-quotes; enclose string constants
 in single-quotes then double-quotes.   
 
    ( CSV_Col_1   FILLER 
    , CSV_Col_2   FILLER 
    , val_num     "1234"
    , name_txt    "'Bill Bagnnn'"
    , some_dts    DATE "YYYY-MM-DD HH24:MI:SS" 
    )

 Adding a Sequence!
 
 If you have an extra column in your table where you would like to put
 a sequence number, then put that column name at the end of your list
 and the keyword "SEQUENCE". Actually it appears that you can put it 
 anywhere in your list: first, in-between, or last. It also does not have
 to be a numeric column. The two values in parenthesis are the starting
 sequence number and then the increment value. If you are appending to
 a table instead of truncate-and-loading, then you can use "SEQUENCE(MAX,1)".
 That will find the maximum sequence value already in your table an then
 increment from there.
 
    ( id_num      
    , CSV_Col_2   FILLER 
    , val_num  
    , name_txt    "initcap(trim( :name_txt ))"
    , some_dts    DATE "YYYY-MM-DD HH24:MI:SS" 
    , extra_col   SEQUENCE(1,1)
    )

 That's it!


Thursday, August 11, 2011

An Example of a Slightly More Complex Pivot Query

In this example were going to look at a pivot query that looks at the information in two columns to decide in which new pivot column the data should be placed. The pivot in this example works in exactly the same way as in the simple-pivot example; there is just a slight difference in syntax.

The data below represents financial information from different months and states. Three different products are shown with type 1 being a profit and type 2 being a loss. There are two measures, a product count, and the total profit or loss.


-----> Example SQL Code Listing-#1:


   WITH
     w_data  as
       ( select  NULL  as mn  /* Month */
              ,  NULL  as st  /* State */             
              ,  NULL  as pd  /* Product */             
              ,  NULL  as tp  /* Type */             
              ,  NULL  as ct  /* Count */             
              ,  NULL  as pf  /* Profit */             
           from  dual  where  1=0
         ----------------------------------------------------------------------
         union all select  '1994-10', 'IL', 'DoDad', 1,  37,  697.45  from dual
         union all select  '1994-10', 'IL', 'DoDad', 2,  13, -303.03  from dual
         union all select  '1994-10', 'IL', 'Gizmo', 2,  23, -558.90  from dual
         union all select  '1994-10', 'IL', 'Gizmo', 2,  21, -468.51  from dual
         union all select  '1994-10', 'IL', 'Wigit', 1,  69,  867.33  from dual
         union all select  '1994-10', 'IL', 'Wigit', 2,  24, -494.64  from dual
         union all select  '1994-10', 'IL', 'Wigit', 2,   4,  -65.92  from dual
         union all select  '1994-10', 'NY', 'DoDad', 1,  71, 1615.96  from dual
         union all select  '1994-10', 'NY', 'DoDad', 1,  45,  840.60  from dual
         union all select  '1994-10', 'NY', 'DoDad', 2,  19, -352.64  from dual
         union all select  '1994-10', 'NY', 'DoDad', 2,  21, -533.19  from dual
         union all select  '1994-10', 'NY', 'DoDad', 2,   4,  -94.28  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 1, 282, 6426.78  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 1, 105, 2272.20  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 1, 221, 4274.14  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 2,   2,  -14.68  from dual
         union all select  '1994-10', 'NY', 'Wigit', 1,  40,  879.60  from dual
         union all select  '1994-10', 'NY', 'Wigit', 1, 285, 6047.70  from dual
         union all select  '1994-11', 'IL', 'DoDad', 1, 196, 2887.08  from dual
         union all select  '1994-11', 'IL', 'DoDad', 1, 101, 2228.06  from dual
         union all select  '1994-11', 'IL', 'DoDad', 1, 333, 5471.19  from dual
         union all select  '1994-11', 'IL', 'DoDad', 2,   7,  -75.95  from dual
         union all select  '1994-11', 'IL', 'DoDad', 2,  25, -419.50  from dual
         union all select  '1994-11', 'IL', 'Gizmo', 1, 255, 4931.70  from dual
         union all select  '1994-11', 'IL', 'Gizmo', 2,  23, -423.43  from dual
         union all select  '1994-11', 'IL', 'Wigit', 1,  36,  700.92  from dual
         union all select  '1994-11', 'IL', 'Wigit', 1, 329, 6385.89  from dual
         union all select  '1994-11', 'IL', 'Wigit', 2,  19, -463.79  from dual
         union all select  '1994-11', 'NY', 'DoDad', 1, 373, 8642.41  from dual
         union all select  '1994-11', 'NY', 'Gizmo', 1,  21,  476.91  from dual
         union all select  '1994-11', 'NY', 'Gizmo', 2,  14,  -99.40  from dual
         union all select  '1994-11', 'NY', 'Wigit', 1, 394, 9550.56  from dual
         union all select  '1994-11', 'NY', 'Wigit', 1,  57, 1046.52  from dual
         union all select  '1994-11', 'NY', 'Wigit', 1, 133, 2386.02  from dual
         union all select  '1994-11', 'NY', 'Wigit', 2,  23, -543.49  from dual
         union all select  '1994-12', 'IL', 'DoDad', 1, 280, 6134.80  from dual
         union all select  '1994-12', 'IL', 'DoDad', 1,  68,  948.60  from dual
         union all select  '1994-12', 'IL', 'DoDad', 2,   6, -148.98  from dual
         union all select  '1994-12', 'IL', 'Gizmo', 1, 252, 5503.68  from dual
         union all select  '1994-12', 'IL', 'Gizmo', 1, 156, 2322.84  from dual
         union all select  '1994-12', 'IL', 'Gizmo', 2,  14, -236.60  from dual
         union all select  '1994-12', 'IL', 'Wigit', 1, 132, 2109.36  from dual
         union all select  '1994-12', 'IL', 'Wigit', 1, 339, 8041.08  from dual
         union all select  '1994-12', 'IL', 'Wigit', 1, 197, 3520.39  from dual
         union all select  '1994-12', 'IL', 'Wigit', 2,  20, -341.60  from dual
         union all select  '1994-12', 'IL', 'Wigit', 2,   4,  -32.20  from dual
         union all select  '1994-12', 'NY', 'DoDad', 1, 308, 5528.60  from dual
         union all select  '1994-12', 'NY', 'DoDad', 1, 335, 6673.20  from dual
         union all select  '1994-12', 'NY', 'DoDad', 1, 109, 2237.77  from dual
         union all select  '1994-12', 'NY', 'DoDad', 2,  17, -326.91  from dual
         union all select  '1994-12', 'NY', 'DoDad', 2,   4,  -83.32  from dual
         union all select  '1994-12', 'NY', 'DoDad', 2,   4,  -51.04  from dual
         union all select  '1994-12', 'NY', 'Wigit', 1, 258, 3397.86  from dual
         union all select  '1994-12', 'NY', 'Wigit', 1,  23,  302.91  from dual
         union all select  '1994-12', 'NY', 'Wigit', 2,   9, -170.73  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 1, 296, 3797.68  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 2,  17, -131.41  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 2,  17, -158.78  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 2,   1,  -13.38  from dual
         ----------------------------------------------------------------------
       )
   select *
     from  w_data ;



So, here's the pivot query. The "FOR" keyword precedes the two columns that contain the values that will be used to determine the pivot groups. Then, following the "IN" keyword, are the list of those values and the name of each pivot-group. Notice that there are two measures and that each pivot-group will include both measures to create twelve (6 x 2 = 12) new columns. This query displays the month, state, and then all twelve new pivot-columns.


-----> Example SQL Code Listing-#2:
   
     
   WITH
     w_data  as
       ( select  NULL  as mn  /* Month */
              ,  NULL  as st  /* State */             
              ,  NULL  as pd  /* Product */             
              ,  NULL  as tp  /* Type */             
              ,  NULL  as ct  /* Count */             
              ,  NULL  as pf  /* Profit */             
           from  dual  where  1=0
         ----------------------------------------------------------------------
         union all select  '1994-10', 'IL', 'DoDad', 1,  37,  697.45  from dual
         union all select  '1994-10', 'IL', 'DoDad', 2,  13, -303.03  from dual
         union all select  '1994-10', 'IL', 'Gizmo', 2,  23, -558.90  from dual
         union all select  '1994-10', 'IL', 'Gizmo', 2,  21, -468.51  from dual
         union all select  '1994-10', 'IL', 'Wigit', 1,  69,  867.33  from dual
         union all select  '1994-10', 'IL', 'Wigit', 2,  24, -494.64  from dual
         union all select  '1994-10', 'IL', 'Wigit', 2,   4,  -65.92  from dual
         union all select  '1994-10', 'NY', 'DoDad', 1,  71, 1615.96  from dual
         union all select  '1994-10', 'NY', 'DoDad', 1,  45,  840.60  from dual
         union all select  '1994-10', 'NY', 'DoDad', 2,  19, -352.64  from dual
         union all select  '1994-10', 'NY', 'DoDad', 2,  21, -533.19  from dual
         union all select  '1994-10', 'NY', 'DoDad', 2,   4,  -94.28  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 1, 282, 6426.78  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 1, 105, 2272.20  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 1, 221, 4274.14  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 2,   2,  -14.68  from dual
         union all select  '1994-10', 'NY', 'Wigit', 1,  40,  879.60  from dual
         union all select  '1994-10', 'NY', 'Wigit', 1, 285, 6047.70  from dual
         union all select  '1994-11', 'IL', 'DoDad', 1, 196, 2887.08  from dual
         union all select  '1994-11', 'IL', 'DoDad', 1, 101, 2228.06  from dual
         union all select  '1994-11', 'IL', 'DoDad', 1, 333, 5471.19  from dual
         union all select  '1994-11', 'IL', 'DoDad', 2,   7,  -75.95  from dual
         union all select  '1994-11', 'IL', 'DoDad', 2,  25, -419.50  from dual
         union all select  '1994-11', 'IL', 'Gizmo', 1, 255, 4931.70  from dual
         union all select  '1994-11', 'IL', 'Gizmo', 2,  23, -423.43  from dual
         union all select  '1994-11', 'IL', 'Wigit', 1,  36,  700.92  from dual
         union all select  '1994-11', 'IL', 'Wigit', 1, 329, 6385.89  from dual
         union all select  '1994-11', 'IL', 'Wigit', 2,  19, -463.79  from dual
         union all select  '1994-11', 'NY', 'DoDad', 1, 373, 8642.41  from dual
         union all select  '1994-11', 'NY', 'Gizmo', 1,  21,  476.91  from dual
         union all select  '1994-11', 'NY', 'Gizmo', 2,  14,  -99.40  from dual
         union all select  '1994-11', 'NY', 'Wigit', 1, 394, 9550.56  from dual
         union all select  '1994-11', 'NY', 'Wigit', 1,  57, 1046.52  from dual
         union all select  '1994-11', 'NY', 'Wigit', 1, 133, 2386.02  from dual
         union all select  '1994-11', 'NY', 'Wigit', 2,  23, -543.49  from dual
         union all select  '1994-12', 'IL', 'DoDad', 1, 280, 6134.80  from dual
         union all select  '1994-12', 'IL', 'DoDad', 1,  68,  948.60  from dual
         union all select  '1994-12', 'IL', 'DoDad', 2,   6, -148.98  from dual
         union all select  '1994-12', 'IL', 'Gizmo', 1, 252, 5503.68  from dual
         union all select  '1994-12', 'IL', 'Gizmo', 1, 156, 2322.84  from dual
         union all select  '1994-12', 'IL', 'Gizmo', 2,  14, -236.60  from dual
         union all select  '1994-12', 'IL', 'Wigit', 1, 132, 2109.36  from dual
         union all select  '1994-12', 'IL', 'Wigit', 1, 339, 8041.08  from dual
         union all select  '1994-12', 'IL', 'Wigit', 1, 197, 3520.39  from dual
         union all select  '1994-12', 'IL', 'Wigit', 2,  20, -341.60  from dual
         union all select  '1994-12', 'IL', 'Wigit', 2,   4,  -32.20  from dual
         union all select  '1994-12', 'NY', 'DoDad', 1, 308, 5528.60  from dual
         union all select  '1994-12', 'NY', 'DoDad', 1, 335, 6673.20  from dual
         union all select  '1994-12', 'NY', 'DoDad', 1, 109, 2237.77  from dual
         union all select  '1994-12', 'NY', 'DoDad', 2,  17, -326.91  from dual
         union all select  '1994-12', 'NY', 'DoDad', 2,   4,  -83.32  from dual
         union all select  '1994-12', 'NY', 'DoDad', 2,   4,  -51.04  from dual
         union all select  '1994-12', 'NY', 'Wigit', 1, 258, 3397.86  from dual
         union all select  '1994-12', 'NY', 'Wigit', 1,  23,  302.91  from dual
         union all select  '1994-12', 'NY', 'Wigit', 2,   9, -170.73  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 1, 296, 3797.68  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 2,  17, -131.41  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 2,  17, -158.78  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 2,   1,  -13.38  from dual
         ----------------------------------------------------------------------
       )
     select  mn,  st
          ,  dodad_p_cnt,  gizmo_p_cnt,  wigit_p_cnt
          ,  dodad_p_amt,  gizmo_p_amt,  wigit_p_amt
          ,  dodad_l_cnt,  gizmo_l_cnt,  wigit_l_cnt
          ,  dodad_l_amt,  gizmo_l_amt,  wigit_l_amt
       from  w_data
      PIVOT  ( sum(ct) as cnt
             , sum(pf) as amt
             FOR ( pd, tp )
             IN  ( ( 'DoDad', 1 )  as dodad_p
                 , ( 'DoDad', 2 )  as dodad_l
                 , ( 'Gizmo', 1 )  as gizmo_p
                 , ( 'Gizmo', 2 )  as gizmo_l
                 , ( 'Wigit', 1 )  as wigit_p
                 , ( 'Wigit', 2 )  as wigit_l
                 )
             )
      order  by mn, st
     ;



And here is the result:



For each month and state's aggregated output row, all the input rows that match the values for a particular pivot-group are grouped together to find the measures. For example, If you look at the last 3 lines of input data they will all aggregate into the December and New York output line (the last one in the result). Actually the last 13 lines of input data will aggregate into the last row of the result; however, The last 3 lines are also part of the "('Gizmo', 2)" pivot-group so they will show up only in the Gizmo-Loss columns. The counts and amounts of these 3 last lines are summed up and show up on the last line of the result. The counts total up to 35 and show up as shown by the red arrow; and the amounts total up to a 303.57 loss and show up as shown by the green arrow.

Monday, August 8, 2011

Oracle's PIVOT Clause -- A Simple Example

Oracle's PIVOT clause allows us to take information that is 'stored in rows' and 'pivot-it-out' so that it is 'stored in columns'.

We need a column (or set of columns) that contains a finite number of identifying values -- these values will determine what new columns are created. In the following example the "qtr" column contains the yearly quarter. Instead of showing our data 'vertically' as shown below, we wish to take the information in the "qtr" column and spread our data out 'horizontally' so we have a column (or set of columns) for each quarter.




Here is our sample data and a simple PIVOT query that accomplish this pivot by quarter.


   WITH  
     w_data  as
       ( select   11 acct,  1 qtr,  23.29 m1   from dual union all
         select   11     ,  2    ,  81.72      from dual union all
         select   11     ,  3    ,  16.72      from dual union all
         select   11     ,  4    ,  20.00      from dual union all
         select   11     ,  4    ,  20.00      from dual union all
         select   11     ,  4    ,  20.00      from dual union all
         select   35     ,  2    ,  11.53      from dual union all
         select   35     ,  3    ,  51.27      from dual union all
         select   35     ,  4    ,  76.14      from dual union all
         select   44     ,  1    ,  53.63      from dual union all
         select   44     ,  2    ,  38.58      from dual union all
         select   44     ,  3    ,  29.76      from dual union all
         select   44     ,  4    ,  47.12      from dual 
       )
   select  *
     from  w_data
    PIVOT  ( sum(m1)   as sm
           , count(*)  as ct
           FOR  qtr  IN 
              ( 1 as q1
              , 2 as q2
              , 3 as q3
              , 4 as q4 
              )
           ) 
    order  by acct ;


And here is the result from running the above query:


Explanation:

First off -- the PIVOT clause is an aggregate operation. It's like having a "Group By" clause in your query without having one spelled out. Any column that is not used inside the "PIVOT" clause is treated as a "Group By" attribute. In this example the "acct" column is the only one not used in the pivot clause and thus the query will "Group By" the "acct" values -- We see that we get one final row for each "acct" value.

The first part of the PIVOT clause contains the definitions of one or more measures (comma separated). In this case we define the "sum(m1)" as a measure called "sm" and a count of the rows in each group as a measure called "ct". If you only define one measure there is no need to specify an alias name -- two or more measures... alias names must be defined.

After the "FOR" key-word, the column (or set of columns) that contains the identifiers that will determine our newly generated 'pivot' columns must be named. In this case the "qtr" column will contain the pivot-column identifiers.

After the "IN" key-word a list of all the identifiers that are of interest is included in parenthesis with their accompanying alias names. Depending on the data and number of identifying columns listed in the "FOR" section, alias names may or may not be required. When the query is executed a "pivot-group" is created for each distinct value in this list.

A new PIVOT column is generated for each defined measure for each "pivot-group". In this case we have 4 "pivot-groups" and 2 measures -- So 8 new columns are generated (4 x 2 = 8). The newly generated column names are created by joining the "pivot-group" aliases with the measure's aliases (joined with an underscore).

The "m1" measure is aggregated -- the values are added up if there is more than one value for each "acct" and "qtr". The "ct" columns show how many values were aggregated for each "acct" and "qtr". In the select clause of the query, you may choose to display as many or as few of the newly generated pivot columns as you wish to display -- in any order.



Here's one final example showing that you can display the columns in any order, and that you can even use the columns as normal columns (i.e. using them as part of a calculation).

 
   /* Include the WITH clause sample data from above here. */ 
   select  acct
        ,  q1_sm , q2_sm , q3_sm , q4_sm
        ,  q1_ct + q2_ct + q3_ct + q4_ct  as ttl_ct
     from  w_data
    PIVOT  ( sum(m1)   as sm
           , count(*)  as ct
           FOR  qtr  in 
              ( 1 as q1
              , 2 as q2
              , 3 as q3
              , 4 as q4 
              )
           ) 
    order  by acct ;




Thursday, August 4, 2011

A basic Hierarchical (Connect-By) Query

Here is a sample data-set that shows a list of people with their associated "ID" numbers. The "prnt" column defines a hierarchical relationship by giving the "ID" number of the person directly above them in the hierarchy (i.e. their boss or parent). Notice that the first two records don't have a 'parent' -- they are top records in the hierarchy.

/* --- code --- */


     WITH  
       w_data  as 
         ( select   1 id, 'Albert'    nm, null prnt  from dual union all
           select   2   , 'Rhonda'      , null       from dual union all
           select   3   , 'Ralph'       ,    1       from dual union all
           select   4   , 'Erma'        ,    1       from dual union all
           select   5   , 'Leopold'     ,    2       from dual union all
           select   6   , 'Zanthia'     ,    3       from dual union all
           select   7   , 'Reginald'    ,    3       from dual union all
           select   8   , 'Hillary'     ,    4       from dual union all
           select   9   , 'Orpheus'     ,    4       from dual union all
           select  10   , 'Linda'       ,    4       from dual union all
           select  11   , 'Ronald'      ,    5       from dual union all
           select  12   , 'Deborah'     ,    5       from dual union all
           select  13   , 'Andrew'      ,    6       from dual union all
           select  14   , 'Maggie'      ,    6       from dual union all
           select  15   , 'Fletcher'    ,    6       from dual union all
           select  16   , 'Wanda'       ,    7       from dual union all
           select  17   , 'Dexter'      ,    8       from dual union all
           select  18   , 'Thelma'      ,    9       from dual union all
           select  19   , 'Wilson'      ,    9       from dual union all
           select  20   , 'Amanda'      ,   10       from dual union all
           select  21   , 'Thor'        ,   10       from dual union all
           select  22   , 'Glenda'      ,   11       from dual union all
           select  23   , 'Norman'      ,   11       from dual union all
           select  24   , 'Edith'       ,   11       from dual union all
           select  25   , 'Arvid'       ,   12       from dual union all
           select  26   , 'Wilma'       ,   12       from dual union all
           select  27   , 'Frederic'    ,   12       from dual union all
           select  28   , 'Unice'       ,   12       from dual 
         ) 
     select  *  from  w_data ;




Now, here's a simple query that uses the prior data and constructs a hierarchical list that clearly shows the parent-child connections. An explanation and the sample output follows. To run this query, combine the sample data from the WITH clause above with the query below.

/* --- code --- */
        
  
    /* Include the WITH clause sample data from above here. */      
    select  lpad( level, level*2 ) as "Level"
         ,  level 
         ,  id
         ,  nm
         ,  prnt
         ,  sys_connect_by_path( nm , ', ' ) as "Path"
      from  w_data
    START WITH ( id in (1,2) )
    CONNECT BY ( prnt = PRIOR id ) 
    ;



The components of the query are described below:

"START WITH"  The query needs to know how to locate the top record or records in the hierarchy. In this case it specifically states that the ID number must be in the list: (1,2). Any properly formed condition would work:
  • START WITH (prnt is NULL)                               -- those who don't have a parent listed
  • START WITH (nm='Albert' or nm='Rhonda')         -- a list of specific names
  • START WITH  (id in ( select id from w_data where prnt is NULL ))      -- even a query can work
You don't even have to 'start with' the top records in the source data. You can specify any record and the results will show that record as the top record and then trace the hierarchy down from their. Try the following and observe that you only get back the specified records and the ones below them.
  • START WITH (id in (3,9))
  • START WITH (nm in ('Linda'.'Reginald')
"CONNECT BY"  There needs to be a clear definition of how a parent record connects to a child record. The key word "PRIOR" is used to refer to the previous or 'higher' record in the hierarchy. In this case the "prnt" column of a record points to the PRIOR or higher record's "id" column. It doesn't matter which of the following you use, they both say the same thing.
  • CONNECT BY (prnt = PRIOR id)
  • CONNECT BY (PRIOR id = prnt)
However if you put the "PRIOR" in front of the "prnt" then you are trying to 'walk' the hierarchy backwards. (we'll get to an example of that).

"LEVEL" is a key word that indicates how deep that particular row will occur in the hierarchy. Note that the second column simply displays the "LEVEL" value where the first column uses the "LEVEL" value to print itself with additional spacing as it goes deeper & deeper into the hierarchy. This gives a nice visual component to the hierarchy that allows us to easily see its structure.

"SYS_CONNECT_BY_PATH"  is a hierarchical function that returns the complete path from the top record to the current record. In this case we are looking at a list of comma separated names. If we would like to see the path as "ID" numbers, we could use the following:
  • sys_connect_by_path(id,'/')




Here are a few other things you can do with simple Connect-By queries:

/* --- code --- */
    
      
     WITH  
       w_data  as 
         ( select   1 id, 'Albert'    nm, null prnt  from dual union all
           select   2   , 'Rhonda'      , null       from dual union all
           select   3   , 'Ralph'       ,    1       from dual union all
           select   4   , 'Erma'        ,    1       from dual union all
           select   5   , 'Leopold'     ,    2       from dual union all
           select   6   , 'Zanthia'     ,    3       from dual union all
           select   7   , 'Reginald'    ,    3       from dual union all
           select   8   , 'Hillary'     ,    4       from dual union all
           select   9   , 'Ronald'      ,    5       from dual union all
           select  10   , 'Deborah'     ,    5       from dual 
         ) 
     SELECT  lpad( level, level*3 ) as "Level"
          ,  id
          ,  nm
          ,  prnt
          ,  substr(sys_connect_by_path(nm,', '),3)  as "Path"
          ,  connect_by_root nm  as top_boss
          ,  connect_by_isleaf   as lowest
       from  w_data
     START WITH ( id in (1,2) )
     CONNECT BY ( prnt = PRIOR id  ) 
      order  siblings by id
     ;
   


First of all, we've fixed the leading comma in the "Path" column by using a substring function.

"CONNECT_BY_ROOT"  This opperator simply gives us the name, id number, or whatever it is we specify for the top-level record of the current hierarchy.

"CONNECT_BY_ISLEAF"  is a 0/1 flag that simply tells us if the current row is the lowest one on this branch of the hierarchy.

"ORDER SIBLINGS BY"  is just a modified version of the normal "Order By" clause. This allows us to order the sibling records of any parent by one of the attributes. In this example we may want to order the siblings by either id number or name. Notice below that Ralph and Erma are siblings and we have specified ordering them by id (3, 4). If we changed to ordering them by name then Erma's entire branch would come before Ralph's.




And finally, in the following example, we've selected all the "ID" numbers that are never listed as parents and used this as our "START WITH" values. We're going to walk-the-tree backwards, or turn the hierarchy upside-down. The only other thing we need to change is the "PRIOR" key word -- we'll now place it in front of the "prnt" column. This effectively reverses our hierarchy.

/* --- code --- */
          

     WITH  
       w_data  as 
         ( select   1 id, 'Albert'    nm, null prnt  from dual union all
           select   2   , 'Rhonda'      , null       from dual union all
           select   3   , 'Ralph'       ,    1       from dual union all
           select   4   , 'Erma'        ,    1       from dual union all
           select   5   , 'Leopold'     ,    2       from dual union all
           select   6   , 'Zanthia'     ,    3       from dual union all
           select   7   , 'Reginald'    ,    3       from dual union all
           select   8   , 'Hillary'     ,    4       from dual union all
           select   9   , 'Ronald'      ,    5       from dual union all
           select  10   , 'Deborah'     ,    5       from dual 
         ) 
     SELECT  lpad( level, level*3 ) as "Level"
          ,  id
          ,  nm
          ,  prnt
          ,  substr(sys_connect_by_path(nm,', '),3)  as "Path"
          ,  connect_by_root nm  as top_boss
          ,  connect_by_isleaf   as lowest
       from  w_data
     START WITH (id in (select id from w_data minus select prnt from w_data))
     CONNECT BY ( PRIOR prnt = id ) 
      order  siblings by id
     ;

   

Note the results. Lowest members now are listed at the #1 level and the path walks its way down to the original parents at the bottom.

Tuesday, August 2, 2011

Analytic Functions: LAG( ) -- Grab a value from a previous row.

The LAG function can be used to return a value from the previous row. This becomes usefull if, for example, you want to calculate a difference between a previous measurment and a current one. Of course this implies that you properly define the order in which the rows are to occur. The LAG function, as other analytic functions, can opperate over the whole data-set; or groups (called partitions) can be defined so that for a particular row, the LAG function only looks for values within a specified subset of the entire data set.

Let's take for instance the following data set: measurements taken for two different people at different times.

-----> Example SQL Code Listing-#1:

   WITH
     w_collected_data  as
       ( select  NULL  as nm
              ,  NULL  as dts
              ,  NULL  as measure
           from  dual  where  1=0
         -----------------------------------------------------------------------------------------------
         union all select 'Liz', to_date('2011-05-12 08:12:00','YYYY-MM-DD HH24:MI:SS'), 31.4  from dual 
         union all select 'Liz', to_date('2011-05-12 10:21:00','YYYY-MM-DD HH24:MI:SS'), 31.1  from dual 
         union all select 'Bob', to_date('2011-05-12 10:23:00','YYYY-MM-DD HH24:MI:SS'), 23.5  from dual 
         union all select 'Bob', to_date('2011-05-12 11:32:00','YYYY-MM-DD HH24:MI:SS'), 24.9  from dual 
         union all select 'Liz', to_date('2011-05-12 12:16:00','YYYY-MM-DD HH24:MI:SS'), 30.7  from dual 
         union all select 'Bob', to_date('2011-05-12 12:44:00','YYYY-MM-DD HH24:MI:SS'), 26.1  from dual 
         union all select 'Bob', to_date('2011-05-12 13:18:00','YYYY-MM-DD HH24:MI:SS'), 27.3  from dual 
         union all select 'Bob', to_date('2011-05-12 14:27:00','YYYY-MM-DD HH24:MI:SS'), 27.1  from dual 
         union all select 'Liz', to_date('2011-05-12 14:36:00','YYYY-MM-DD HH24:MI:SS'), 29.5  from dual 
         union all select 'Bob', to_date('2011-05-12 15:13:00','YYYY-MM-DD HH24:MI:SS'), 26.8  from dual 
         union all select 'Liz', to_date('2011-05-12 17:01:00','YYYY-MM-DD HH24:MI:SS'), 29.2  from dual
         -----------------------------------------------------------------------------------------------
       )
   select  *
     from  w_collected_data
    order  by  nm, dts
   ;

-------------------------
-- Results
-------------------------

   NM   DTS                  MEASURE
   ---  -------------------  -------
   Bob  2011-05-12 10:23:00  23.5
   Bob  2011-05-12 11:32:00  24.9
   Bob  2011-05-12 12:44:00  26.1
   Bob  2011-05-12 13:18:00  27.3
   Bob  2011-05-12 14:27:00  27.1
   Bob  2011-05-12 15:13:00  26.8
   Liz  2011-05-12 08:12:00  31.4
   Liz  2011-05-12 10:21:00  31.1
   Liz  2011-05-12 12:16:00  30.7
   Liz  2011-05-12 14:36:00  29.5
   Liz  2011-05-12 17:01:00  29.2



Now let's suppose that we want to find the difference between one measurement to the next.

-----> Example SQL Code Listing-#2:

   WITH
     w_collected_data  as
       ( select  NULL  as nm
              ,  NULL  as dts
              ,  NULL  as measure
           from  dual  where  1=0
         -----------------------------------------------------------------------------------------------
         union all select 'Liz', to_date('2011-05-12 08:12:00','YYYY-MM-DD HH24:MI:SS'), 31.4  from dual 
         union all select 'Liz', to_date('2011-05-12 10:21:00','YYYY-MM-DD HH24:MI:SS'), 31.1  from dual 
         union all select 'Bob', to_date('2011-05-12 10:23:00','YYYY-MM-DD HH24:MI:SS'), 23.5  from dual 
         union all select 'Bob', to_date('2011-05-12 11:32:00','YYYY-MM-DD HH24:MI:SS'), 24.9  from dual 
         union all select 'Liz', to_date('2011-05-12 12:16:00','YYYY-MM-DD HH24:MI:SS'), 30.7  from dual 
         union all select 'Bob', to_date('2011-05-12 12:44:00','YYYY-MM-DD HH24:MI:SS'), 26.1  from dual 
         union all select 'Bob', to_date('2011-05-12 13:18:00','YYYY-MM-DD HH24:MI:SS'), 27.3  from dual 
         union all select 'Bob', to_date('2011-05-12 14:27:00','YYYY-MM-DD HH24:MI:SS'), 27.1  from dual 
         union all select 'Liz', to_date('2011-05-12 14:36:00','YYYY-MM-DD HH24:MI:SS'), 29.5  from dual 
         union all select 'Bob', to_date('2011-05-12 15:13:00','YYYY-MM-DD HH24:MI:SS'), 26.8  from dual 
         union all select 'Liz', to_date('2011-05-12 17:01:00','YYYY-MM-DD HH24:MI:SS'), 29.2  from dual
         -----------------------------------------------------------------------------------------------
       )
   select  nm
        ,  dts
        ,  measure
        ,            lag( measure ) over ( order by dts )  as previous
        ,  measure - lag( measure ) over ( order by dts )  as msr_increase
     from  w_collected_data
    order  by  dts
   ;

-------------------------
-- Results
-------------------------

   NM   DTS                  MEASURE  PREVIOUS  MSR_INCREASE
   ---  -------------------  -------  --------  ------------
   Liz  2011-05-12 08:12:00  31.4
   Liz  2011-05-12 10:21:00  31.1     31.4      -0.3
   Bob  2011-05-12 10:23:00  23.5     31.1      -7.6
   Bob  2011-05-12 11:32:00  24.9     23.5       1.4
   Liz  2011-05-12 12:16:00  30.7     24.9       5.8
   Bob  2011-05-12 12:44:00  26.1     30.7      -4.6
   Bob  2011-05-12 13:18:00  27.3     26.1       1.2
   Bob  2011-05-12 14:27:00  27.1     27.3      -0.2
   Liz  2011-05-12 14:36:00  29.5     27.1       2.4
   Bob  2011-05-12 15:13:00  26.8     29.5      -2.7
   Liz  2011-05-12 17:01:00  29.2     26.8       2.4


Notice that the "lag(measure)" gives us the previous measure where the "over (order by dts)" defines how the rows are to be sorted when finding the previous row. The entire analytic function given by "lag(measure) over (order by dts)" returns the previous row's measure value if the rows were to be sorted by "dts". The first row has no previous value because there was no previous row to pull it from -- it's value is set to NULL.

In an analytic function the "over (order by *)" is only used to determine which row the value will be taken from -- it doesn't determine how the final output data-set will have its rows sorted; that is always determined in the "Order By" clause of the query.

Also, notice that this new analytic function, "lag(measure) over (order by dts)", can now be used in a calculation (the measure minus the previous measure) to show how much the measure changed from one row to the next.



Next we might realize that the Bob's measurements don't relate to the measurements of Liz. We would like to look at the two people's measurements seperately -- not inner-mixed. This means we will need to split-up the data by the "NM" column -- in analytic functions we do this via the "partition by" operator within the "over" clause.

-----> Example SQL Code Listing-#3:

   WITH
     w_collected_data  as
       ( select  NULL  as nm
              ,  NULL  as dts
              ,  NULL  as measure
           from  dual  where  1=0
         -----------------------------------------------------------------------------------------------
         union all select 'Liz', to_date('2011-05-12 08:12:00','YYYY-MM-DD HH24:MI:SS'), 31.4  from dual 
         union all select 'Liz', to_date('2011-05-12 10:21:00','YYYY-MM-DD HH24:MI:SS'), 31.1  from dual 
         union all select 'Bob', to_date('2011-05-12 10:23:00','YYYY-MM-DD HH24:MI:SS'), 23.5  from dual 
         union all select 'Bob', to_date('2011-05-12 11:32:00','YYYY-MM-DD HH24:MI:SS'), 24.9  from dual 
         union all select 'Liz', to_date('2011-05-12 12:16:00','YYYY-MM-DD HH24:MI:SS'), 30.7  from dual 
         union all select 'Bob', to_date('2011-05-12 12:44:00','YYYY-MM-DD HH24:MI:SS'), 26.1  from dual 
         union all select 'Bob', to_date('2011-05-12 13:18:00','YYYY-MM-DD HH24:MI:SS'), 27.3  from dual 
         union all select 'Bob', to_date('2011-05-12 14:27:00','YYYY-MM-DD HH24:MI:SS'), 27.1  from dual 
         union all select 'Liz', to_date('2011-05-12 14:36:00','YYYY-MM-DD HH24:MI:SS'), 29.5  from dual 
         union all select 'Bob', to_date('2011-05-12 15:13:00','YYYY-MM-DD HH24:MI:SS'), 26.8  from dual 
         union all select 'Liz', to_date('2011-05-12 17:01:00','YYYY-MM-DD HH24:MI:SS'), 29.2  from dual
         -----------------------------------------------------------------------------------------------
       )
   select  nm
        ,  dts
        ,  measure
        ,               lag( measure ) over (partition by nm order by dts)  as previous
        ,     measure - lag( measure ) over (partition by nm order by dts)  as msr_increase
        ,  round((dts - lag( dts     ) over (partition by nm order by dts))*24*60)  as min_between
     from  w_collected_data
    order  by  nm, dts
   ;

-------------------------
-- Results
-------------------------

   NM   DTS                  MEASURE  PREVIOUS  MSR_INCREASE  MIN_BETWEEN
   ---  -------------------  -------  --------  ------------  -----------
   Bob  2011-05-12 10:23:00  23.5
   Bob  2011-05-12 11:32:00  24.9     23.5       1.4           69
   Bob  2011-05-12 12:44:00  26.1     24.9       1.2           72
   Bob  2011-05-12 13:18:00  27.3     26.1       1.2           34
   Bob  2011-05-12 14:27:00  27.1     27.3      -0.2           69
   Bob  2011-05-12 15:13:00  26.8     27.1      -0.3           46
   Liz  2011-05-12 08:12:00  31.4
   Liz  2011-05-12 10:21:00  31.1     31.4      -0.3          129
   Liz  2011-05-12 12:16:00  30.7     31.1      -0.4          115
   Liz  2011-05-12 14:36:00  29.5     30.7      -1.2          140
   Liz  2011-05-12 17:01:00  29.2     29.5      -0.3          145


Now we see two seperate groups of data, one for Bob and one for Liz. In an analytic function we can 'partition by' mutliple values/columns and sort by multiple values/columns. This makes the analytic functions very powerful. For instance we might have a LAG function that looks like this: "lag(measure) over (partition by region, store, clerk order by product, sales_date)".

Additionally, the previous example added a column to show how many minutes were between the measurments using this LAG function: "lag(dts) over (partition by nm order by dts)".



Another feature of the LAG function is that you can specify how many rows to 'look-back'. The default is 1 -- or the previous row; but you can change that to 2 as shown in the following example. A third parameter can also be supplied as a defauld value (instead of NULL) for rows that don't have a previous value.

-----> Example SQL Code Listing-#4:

   WITH
     w_collected_data  as
       ( select  NULL  as nm
              ,  NULL  as dts
              ,  NULL  as measure
           from  dual  where  1=0
         -----------------------------------------------------------------------------------------------
         union all select 'Liz', to_date('2011-05-12 08:12:00','YYYY-MM-DD HH24:MI:SS'), 31.4  from dual 
         union all select 'Liz', to_date('2011-05-12 10:21:00','YYYY-MM-DD HH24:MI:SS'), 31.1  from dual 
         union all select 'Bob', to_date('2011-05-12 10:23:00','YYYY-MM-DD HH24:MI:SS'), 23.5  from dual 
         union all select 'Bob', to_date('2011-05-12 11:32:00','YYYY-MM-DD HH24:MI:SS'), 24.9  from dual 
         union all select 'Liz', to_date('2011-05-12 12:16:00','YYYY-MM-DD HH24:MI:SS'), 30.7  from dual 
         union all select 'Bob', to_date('2011-05-12 12:44:00','YYYY-MM-DD HH24:MI:SS'), 26.1  from dual 
         union all select 'Bob', to_date('2011-05-12 13:18:00','YYYY-MM-DD HH24:MI:SS'), 27.3  from dual 
         union all select 'Bob', to_date('2011-05-12 14:27:00','YYYY-MM-DD HH24:MI:SS'), 27.1  from dual 
         union all select 'Liz', to_date('2011-05-12 14:36:00','YYYY-MM-DD HH24:MI:SS'), 29.5  from dual 
         union all select 'Bob', to_date('2011-05-12 15:13:00','YYYY-MM-DD HH24:MI:SS'), 26.8  from dual 
         union all select 'Liz', to_date('2011-05-12 17:01:00','YYYY-MM-DD HH24:MI:SS'), 29.2  from dual
         -----------------------------------------------------------------------------------------------
       )
   select  nm
        ,  dts
        ,  measure
        ,  lag( measure, 2, 0 ) over (partition by nm order by dts)  as previous
     from  w_collected_data
    order  by  nm, dts
   ;

-------------------------
-- Results
-------------------------

   NM   DTS                  MEASURE  PREVIOUS
   ---  -------------------  -------  --------
   Bob  2011-05-12 10:23:00  23.5     0
   Bob  2011-05-12 11:32:00  24.9     0
   Bob  2011-05-12 12:44:00  26.1     23.5
   Bob  2011-05-12 13:18:00  27.3     24.9
   Bob  2011-05-12 14:27:00  27.1     26.1
   Bob  2011-05-12 15:13:00  26.8     27.3
   Liz  2011-05-12 08:12:00  31.4     0
   Liz  2011-05-12 10:21:00  31.1     0
   Liz  2011-05-12 12:16:00  30.7     31.4
   Liz  2011-05-12 14:36:00  29.5     31.1
   Liz  2011-05-12 17:01:00  29.2     30.7




One last example. Here's a data-set that shows the ID numbers for 3 different runners passing through checkpoints 0-3. A LAG function is used to obtain the previous "clock" value, and then a calculation is made with that function to show the minutes between each check-point.

-----> Example SQL Code Listing-#5:

   WITH
     w_check_points  as
       ( select  NULL  as id
              ,  NULL  as ckpt
              ,  NULL  as clock
           from  dual  where 1=0
         ----------------------------------------------
         union all select  12, 0, '00:00:00'  from dual 
         union all select  47, 0, '00:00:00'  from dual 
         union all select  83, 0, '00:00:00'  from dual 
         union all select  12, 1, '01:00:16'  from dual 
         union all select  47, 1, '00:43:27'  from dual 
         union all select  83, 1, '00:40:18'  from dual 
         union all select  12, 2, '02:12:47'  from dual 
         union all select  47, 2, '01:32:53'  from dual 
         union all select  83, 2, '01:49:27'  from dual 
         union all select  12, 3, '03:27:14'  from dual 
         union all select  47, 3, '02:25:04'  from dual 
         union all select  83, 3, '03:31:38'  from dual
         ----------------------------------------------
       )
   select  id
        ,  ckpt
        ,  clock
        ,  lag(clock) over (partition by id  order by ckpt)  as prev_clock
        ,  round
             ( ( to_date(clock, 'HH24:MI:SS') 
               - to_date(lag(clock) over (partition by id order by ckpt), 'HH24:MI:SS')
               ) * 24 * 60
             , 3 
             )  as min
     from  w_check_points
    order  by  id, ckpt
   ;

-------------------------
-- Results
-------------------------

   ID   CKPT  CLOCK     PREV_CLOCK  MIN
   ---  ----  --------  ----------  ---------
   12   0     00:00:00
   12   1     01:00:16  00:00:00     60.267
   12   2     02:12:47  01:00:16     72.517
   12   3     03:27:14  02:12:47     74.45
   47   0     00:00:00
   47   1     00:43:27  00:00:00     43.45
   47   2     01:32:53  00:43:27     49.433
   47   3     02:25:04  01:32:53     52.183
   83   0     00:00:00
   83   1     00:40:18  00:00:00     40.3
   83   2     01:49:27  00:40:18     69.15
   83   3     03:31:38  01:49:27    102.183