Thursday, August 18, 2011

Use Regular Expressions to Extract CSV-Type Values

Here are a couple of quick examples of how to extract the individual elements of a CSV-like string using Oracle's regular expressions.



The first example uses a rather simple regular expression, but if there is a comma within one of the values -- it will fail (for example: "Jones, Joe"). It will account for either having the values enclosed in quotes or not, but the comma can only be there to seperate values.

The second example corrects this issue, but the regular expression has to be just a bit more complicated. An explanation of each of the two regular expressions is included at the end.



The first example below shows the sample data in the first "WITH" sub-query. Notice that the data is intentionally messy; it has extra spaces and sometimes includes double-quote marks around the values.

The second "WITH" sub-query uses the "REGEXP_REPLACE" function to extract one item at a time from the entire string. The regular expression must match with each record of the source data. The internal sets of parenthesis in the expression identify the four seperate data elements. In the function, the last parameter (ie: '\3') tells it to replace the entire CSV record with that one particular value (identified by one of the four sets of parenthesis within the regular expression). So, when this subquery finishes, we now have the CSV-type records broken down into four text columns (c1, c2, c3, c4).

The third "WITH" sub-query trims off extra spaces and quote marks, and then casts each data element to its proper data type. Notice that the DATE strings all need to match the format pattern given here to be able to successfully convert them into date data-types.

/* ===== code ===== */
 
   WITH /* Extract and Cast CSV Data */
     w_csv  as
       ( select  '        ,        ,        ,        '                          x from dual union all
         select  '    8  , 37.251  , "Joe Jones"   , "2009-08-18 13:27:33" '      from dual union all
         select  '   ""  ,  6.471  ,  Sam Smith    , "2010-12-30 16:13:39" '      from dual union all
         select  '   10  ,  0      ,               ,                       '      from dual union all
         select  '   19  , "0.25"  ,  Bob Benson   , " 2003-10-24 23:59:01 " '    from dual union all
         select  '  "47" , 13.23   ,  " "          ,  2007-05-02 22:58:02  '      from dual union all
         select  '  106  , 44.732  ,  "Liz Loyd"   , ""                    '      from dual union all
         select  ' 2249  ,   ""    , "Nelson Ned"  ,  1997-06-05 04:42:26  '      from dual union all
         select  '" 32154 ",0.001,"Fizwig Ray M","1913-02-04 18:01:00"'           from dual union all
         select  '"","42.1234",Sam B Smith,"2010-12-30 16:13:39"'                 from dual union all
         select  '"  12  ",2.4,"",2007-05-02 22:58:02'                            from dual union all
         select  '98,".9732","Jan James",""'                                      from dual union all
         select  ',,,'                                                            from dual
       )
   , w_elements  as
       ( select  regexp_replace(x,'^(.*),(.*),(.*),(.*)$','\1')  as c1
              ,  regexp_replace(x,'^(.*),(.*),(.*),(.*)$','\2')  as c2
              ,  regexp_replace(x,'^(.*),(.*),(.*),(.*)$','\3')  as c3
              ,  regexp_replace(x,'^(.*),(.*),(.*),(.*)$','\4')  as c4
           from  w_csv
       )
   , w_data  as
       ( select  to_number(trim('"' from trim(c1)))  as id
              ,  to_number(trim('"' from trim(c2)))  as val
              ,    to_char(trim('"' from trim(c3)))  as nam
              ,    to_date(trim('"' from trim(c4)), 'YYYY-MM-DD HH24:MI:SS')  as dts
           from  w_elements
       )
   SELECT  *
     from  w_data ;


The Results:




In this, next example, the regular expression will allow you to put a coma within a string if you enclose that string in double-quotes.

/* ===== code ===== */
 
   WITH /* Extract and Cast CSV Data */
     w_csv  as
       ( select  '        ,        ,        ,        '                          x from dual union all
         select  '    8  , 37.251  , "Joe Jones"   , "2009-08-18 13:27:33" '      from dual union all
         select  '   ""  ,  6.471  ,  Sam Smith    , "2010-12-30 16:13:39" '      from dual union all
         select  '   10  ,  0      ,               ,                       '      from dual union all
         select  '   19  , "0.25"  ,  Bob Benson   , " 2003-10-24 23:59:01 " '    from dual union all
         select  '  "47" , 13.23   ,  ","          ,  2007-05-02 22:58:02  '      from dual union all
         select  '  106  , 44.732  ,  "Liz Loyd"   , ""                    '      from dual union all
         select  ' 2249  ,   ""    , "Nelson, Ned"  ,  1997-06-05 04:42:26  '     from dual union all
         select  '" 32154 ",0.001,"Fizwig, Ray, M","1913-02-04 18:01:00"'         from dual union all
         select  '"","42.1234",Sam B Smith,"2010-12-30 16:13:39"'                 from dual union all
         select  '"  12  ",2.4,"",2007-05-02 22:58:02'                            from dual union all
         select  '98,".9732","Jan James,",""'                                     from dual union all
         select  ',,,'                                                            from dual
       )
   , w_elements  as
       ( select  regexp_replace(x,'^ *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *$','\1')  as c1
              ,  regexp_replace(x,'^ *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *$','\2')  as c2
              ,  regexp_replace(x,'^ *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *$','\3')  as c3
              ,  regexp_replace(x,'^ *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *, *(".*"|[^,]*) *$','\4')  as c4
           from  w_csv
       )
   , w_data  as
       ( select  to_number(trim('"' from trim(c1)))  as id
              ,  to_number(trim('"' from trim(c2)))  as val
              ,    to_char(trim('"' from trim(c3)))  as nam
              ,    to_date(trim('"' from trim(c4)), 'YYYY-MM-DD HH24:MI:SS')  as dts
           from  w_elements
       )
   SELECT  *
     from  w_data ;


The Results:





Here is an explanation of the two regular expressions used above in the "REGEXP_REPLACE" function.

 
 
Note that in these explanations a space is shown as an underscore (_).  
In the query's code, the space is a normal space character.

Regular Expression:  '^(.*),(.*),(.*)$'

a)  ^       The string begins,  followed by ...
b)  (       an open parenthesis (the start of a column's value),  followed by ...
c)  .*      zero or more (*) characters of anything (.)
d)  )       a closing parenthesis (the end of a column's value),  followed by ...
              (Note: the above {b through d} is repeated, seperated by commas, for each column of CSV data.)
e)  ,       a comma (,) if there are more columns,  followed by ...
f)  $       the end of the string ($).



Regular Expression:  '^_*(".*"|[^,]*)_*,_*(".*"|[^,]*)_*,_*(".*"|[^,]*)_*$'

a)  ^       The string begins,  followed by ...
b)  _*      zero or more (*) spaces (_),  followed by ...
c)  (       an open parenthesis (the start of a column's value),  followed by ...
d)  ".*"    zero or more (*) characters of anything (.) enclosed by quotes (")
e)  |       OR
f)  [^,]*   zero or more (*) characters of anything except a comma ([^,]),  followed by ...
g)  )       a closing parenthesis (the end of a column's value),  followed by ...
h)  _*      zero or more (*) spaces (_),  followed by ...
              (Note: the above {b through h} is repeated, seperated by commas, for each column of CSV data.)
i)  ,       a comma (,) if there are more columns,  followed by ...
j)  $       the end of the string ($).



Tuesday, August 16, 2011

Using SQL-Loader to Load CSV Data Into a Table.

Given a CSV (comma separated values) file of data, it's actually not that hard to use Oracle's "SQL-Loader" to insert it all into a table. The use of "SQL-Loader" requires that you have an Oracle Client properly installed on your machine.

Here's a small set of CSV data -- intentionally made just a little bit messy. Note that some of the fields are enclosed by quote marks and others are not; if a string has a comma within it (such as some of the names below) then it MUST be enclosed by quotes. Notice that the first row of data is actually the list of column names. And, take note of the date format.

If you want to run this demo yourself copy the data and other files below into one single folder on your machine as indicated. You can supply directory paths along with the file-names, but putting everything in one folder eliminates the need for that & makes things a little simpler. This demo was executed on a Windows machine, but it works very much the same on a Unix machine (of course you would use a shell script file instead of a batch file).

Create this file as "SourceData.csv":
ID_VAL, JUNK, NUM_VAL, PERSON_NM, THE_DATE
    8  ,yy, 37.251  , "Joe Jones"   , "2009-08-18 13:27:33" 
   ""  ,  zzz  ,  6.471  ,  Sam Smith    , "2010-12-30 16:13:39" 
      ,        ,        ,        ,        
   10  ,1,  0      ,               ,                       
   19  ,2, "0.25"  ,  Bob Benson   , "9999-99-99 99:99:99" 
  "47" , 3 , 13.23   ,  ","          ,  2007-05-02 22:58:02  
  106  , , 44.732  , "Liz Loyd"    , ""                    
 2249  ,"",   ""    , "Nelson, Ned" ,  1997-06-05 04:42:26  
"32154", ,0.001,"fizwig, Ray, M","1913-02-04 18:01:00"
"",abc,"42.1234",Sam b smiTh,"2010-12-30 16:13:39"
"12",~,2.4,",",2007-05-02 22:58:02
,,,,
98, ",",".9732","jAmes, jAn",""



We will need a table in which to load the data. Note that the column names do not need to match the names given in the CSV file. Execute this command in a SQL session -- you don't need to put this in a file on your machine.
/* ===== code ===== */
 
     Create table scott.loader_demo
       ( id_num      integer
       , some_dts    date
       , name_txt    varchar2(20)
       , extra_col   varchar2(5)
       , val_num     number
       ) ;



Next we will need a control file -- this is the heart of using "SQL-Loader". It defines where the data is coming from and where it is going; it contains all the details of how the load should proceed. In this example we see how to do a basic load of simple data-types. The following control file contains lots of comments -- all these, of course, can be stripped out.

Create this file as "ControlFile.ctl":
------------------------------------------------------------
-- SQL-Loader Control File
------------------------------------------------------------
OPTIONS
   ( skip=1                                    -- Note {op1}
   , errors=10                                 -- Note {op2}
-- , load=100                                  -- Note {op3}
   , direct=true                               -- Note {op4}
   , rows=10000                                -- Note {op5}
-- , bindsize=67108864                         -- Note {op6}
-- , readsize=67108864                         -- Note {op7}
   )
------------------------------------------------------------
UNRECOVERABLE                                  -- Note {L1}
LOAD DATA                                      -- Note {L2}
   INFILE        'SourceData.csv'              -- Note {L3}
   BADFILE       'LoadResults_BAD.log'         -- Note {L4}
   DISCARDFILE   'LoadResults_DISCARDED.log'   -- Note {L5}
   APPEND                                      -- Note {L6}
   INTO TABLE     scott.loader_demo            -- Note {L7}
-- WHEN           id_num <> ''                 -- Note {L8}
-------------------------------------------------------
Fields Terminated    by ","                    -- Note {x1}
Optionally Enclosed  by '"'                    -- Note {x2}
Trailing Nullcols                              -- Note {x3}
------------------------------------------------------------
                                               -- Note {c1}
   ( id_num
   , junk_01     FILLER                        -- Note {c2}
   , val_num
   , name_txt   "initcap(trim(:name_txt))"     -- Note {c3}
   , some_dts   DATE "YYYY-MM-DD HH24:MI:SS"   -- Note {c4}
   )
------------------------------------------------------------
-- {op1}: Skip the first <n> header rows (column names).
-- {op2}: Allow <n> errors before aborting the load.
-- {op3}: Load only this many rows, then stop.
-- {op4}: Direct=true can be much faster; data-blocks
--          are directly built onto the table.
--        Direct=false does conventional inserts;
--          Indexes updated as each row is inserted.
--        Indexes and Constraints affect the
--          performance of either load type.
-- {op5}: Load <n> rows between each commit.
-- {op6}: For conventional loads, setting this to a
--          large value can increase performance.
-- {op7}: Works in conjunction with 'bindsize'.
---------
-- {L1}:  Can speed up the load. In case of database
--          fail during load, loaded data is be lost.
-- {L2}:  Section to specify files and tables
-- {L3}:  The name of the CSV source data file.
-- {L4}:  Error Records will be written to this file.
-- {L5}:  Valid rows not loaded, written to this file.
--          (All-NULLs or rows filtered out by a WHEN)
-- {L6}:  Table load method, choose one of the following:
--          INSERT, APPEND, REPLACE, or TRUNCATE.
-- {L7}:  The name of the destination Oracle table.
-- {L8}:  Load only records that meet a condition.
---------
-- {x1}:  The field terminator ( usually <,> or <|> ).
-- {x2}:  The quote character ( usually <"> ).
-- {x3}:  Missing fields at end of record = NULL.
---------
-- {c1}:  This section is the column list. The number
--          and order of the columns should match your
--          CSV data, not necessarily the target table.
--        The column names in this list should be the
--          column names of your target table.
-- {c2}:  Use key word FILLER to ignore a column of CSV
--          data. The name in this list does not matter.
-- {c3}:  Example using functions on CSV values.
--          Use quotes around functions as shown.
-- {c4}:  Use key word DATE to convert CSV date values.
--          Format string in ("") must match CSV data.
------------------------------------------------------------


A further explanation of the control file is given at the end of this article.

The last thing we need to do is to execute the load -- the command can be issued from a command prompt; but in this example we put the command in a batch file. In Unix you could put the command in a shell-script.

In single quotes, you'll need to supply the proper 'user-name / password @ database-identifier' the last 2 elements point the load at your newly created control file and define the results log file. Now if you have a properly installed Oracle-Client, double-clicking on this batch file will execute the load.

If you don't want to put the password in the batch file just leave it out (as well as the preceding slash) and it will ask you for the password at run-time (i.e. 'scott @ DBSID' ).

Create this file as "GoLoad.bat": (Edit the Log-In information)
@echo off
echo.
sqlldr  'scott / tiger @ DBSID'  control='ControlFile.ctl'   log='LoadResults.log'
pause



When the load finishes, you can observe that the specified data has been inserted into your table:



Don't forget to examine the three 'LoadResults' log files. They contain information about the load process. You will be able to see which records didn't load & why.





Another interesting detail is that you can combine your data and the control file as shown in the following example. Here the "INFILE" is specified as "*" -- and you need to include the "BEGINDATA" just prior to the lines of CSV data. This would be executed in the same way the previous example was executed.

----------------------------------------------------------------------
-- SQL-Loader: A Control File that Includes the DATA.
----------------------------------------------------------------------
OPTIONS 
   ( skip=1
   , rows=10000
   , errors=10 
   )
----------------------------------------------------------------------
LOAD DATA
   INFILE        *
   BADFILE       'LoadResults_BAD.log'            
   DISCARDFILE   'LoadResults_DISCARDED.log'      
   APPEND                                         
   INTO TABLE     scott.loader_demo                               
----------------------------------------------------------------------
Fields Terminated    by ","                               
Optionally Enclosed  by '"'                               
Trailing Nullcols                                       
----------------------------------------------------------------------
   ( id_num
   , junk_01    FILLER
   , val_num  
   , name_txt   "initcap(trim(:name_txt))" 
   , some_dts   DATE "YYYY-Mon-DD HH24:MI" 
   )
----------------------------------------------------------------------
BEGINDATA
"ID_NUM","junk","A_VALUE","FULL_NAME","THE_DATE"
8,"a",37.251,"Joe jones","2009-Aug-18 13:27"
6,b,23.9,Ray Rulon,1984-Jan-23 02:37 
13  ,  c  ,  0.2  ,  Elie Ellison  , 1998-Feb-13 13:28
,d,6.471,sam SmItH,2010-Dec-30 16:13
47,"e",13.23,"","2007-May-02 22:58"    
993,"bad-date",274.51,"Bad, Bob","01/01/2001"
421,"f",24.5,"   Sanders, slY    ","2003-Sep-12 23:59" 
2,"g",1432.8,"Jan Jepson","2005-Jul-05 16:02" 
106,"h","44.732","Liz Loyd",""               
,,,,
1,,,,
27,"i",1.567,"bOb bEnnie bEnson","2000-Mar-30 10:44" 
7,,22.21,"Mel morris","1978-Apr-01 04:10" 
2249,"j",,"Nelson, ned","1997-Jun-05 04:42"
0,"",0,"Empty",






Additional notes on the control file:

The first section of the control-file is the "OPTIONS" -- here you set a few details of the load process. The "skip" option allows you to skip as many rows as you wish at the beginning of the CSV file. Normally it is used to skip one row of column names if they exist. If there are no column names, just specify "skip=0".

The "rows" option allows you to specify how many rows will be loaded between each "Commit". For most circumstances, something like "rows=10000" would work just fine.

The "errors" option allows you to specify how many errors will be tolerated before the whole load process is aborted. If your data is very bad and you just want to load the rows that will load, you could set this to a high value.

The "load" parameter is usually included while testing a load; it tells the process to load a specific number of rows and then quit -- even if the CSV file has many more rows of data.

The next section "LOAD DATA" is where you define the file-names (in single quotes) and the destination table name (not in quotes).

The "INFILE" is your CSV data file that contains the records you wish to load.

The "BADFILE" will contain all the records from the load process that 'errored-off' for some reason.

The "DISCARDFILE" will contain all the records that didn't load because all the columns were NULL or were disqualified by the "WHEN" clause.

The keyword "APPEND" will insert the new records into the table leaving existing records in the table. If you specify "TRUNCATE" instead of "APPEND" then the table will be truncated prior to loading the new records.

You can include an optional "WHEN" clause with a condition that controls which CSV records will be loaded. Some possible examples include: id_num <> '' , id_num = '8' . Remember that this is not a "WHERE" clause and you can't use Oracle functions here -- it is actually quite limited.

The next section defines the delimiter (usually a comma) and the quote marks that can be used to enclose a value. "Trailing Nullcols" tells the loader to interpret missing values at the end of records as NULLs.

The last and very important section of the control file is the column list. Make sure that the order and number of columns match your CSV data -- and that the column names match your destination table (except for "FILLER" columns). So, the position in your control-list tells SQL-Loader which column in the CSV file to get the value from, and its name in your control-list tells SQL-Loader which column in your destination table the value will be placed in.

If you have a CSV column that you wish to ignore, give it any name (junk, dummy, etc.) and put the keyword "FILLER" immediately after. This tells SQL-Loader to ignore that CSV column of data.

Notice that you do not need to insert data into every column in your destination table. Table columns not listed in the control-list will be left NULL.

Date values should be followed with the "DATE" keyword and the appropriate date format string. CSV records that have dates not matching this string will be rejected.

It is also possible to manipulate the data during the load process. Notice that you can use Oracle functions on the values by enclosing the expression in double quotes as shown. Make sure you repeat the column name preceded by a colon inside the function expression.


A few more possibilities in manipulating the data:

 
 Review & Going further... 
 
 In your column list, remember that the number and position of the 
 entries should match your CSV data file.
                                                      
    ( CSV_Col_1    
    , CSV_Col_2    
    , CSV_Col_3  
    , CSV_Col_4  
    , CSV_Col_5  
    )  
 
 Mark the CSV columns that you won't use with the "FILLER" keyword.
 The column names of unused filler data do not matter.
    
    ( CSV_Col_1   FILLER 
    , CSV_Col_2   FILLER 
    , CSV_Col_3  
    , CSV_Col_4  
    , CSV_Col_5  
    )
 
 The columns that you will use need to be labeled with the name of the
 table column where they will be loaded. 
 
    ( CSV_Col_1   FILLER 
    , CSV_Col_2   FILLER 
    , val_num  
    , name_txt  
    , some_dts  
    )
 
 
 Manipulating the CSV Values! 
 
 You can then add modifiers to alter the data. DATE values will
 need to have the format string included after the "DATE" keyword.
 To use incoming CSV values in an expression with Oracle functions,
 you refer to them with the name of their destination column as a 
 bind variable (ex ":name_txt"). With that as the column-name you
 then use Oracle functions and enclose the whole expression in 
 double-quotes as shown on the name_txt line.
 
    ( CSV_Col_1   FILLER 
    , CSV_Col_2   FILLER 
    , val_num  
    , name_txt    "initcap(trim( :name_txt ))"
    , some_dts    DATE "YYYY-MM-DD HH24:MI:SS" 
    )
 
 
 Date Manipulation!
 
 There are a couple of extra things you can do with a date column.
 Instead of using the DATE keyword, you can use the "to_date" function.
 You can even add more functions to manipulate the date as shown on
 the third row below (this would truncate the date and set it to the 
 last day of the previous month). Notice the fourth row shows how to
 assign the SYSDATE to a date column -- incoming CSV values are ignored.
    
    , some_dts    DATE "YYYY-MM-DD HH24:MI:SS" 
    , some_dts    "to_date( :some_dts ,'YYYY-MM-DD HH24:MI:SS')" 
    , some_dts    "trunc(to_date( :some_dts ,'YYYY-MM-DD HH24:MI:SS'),'MM')-1" 
    , some_dts    "sysdate"
   

 Using Constants!

 You can also put a constant number in a numeric field or a constant
 string value in a Varchar2 field -- ignoring the CSV values.   
 Enclose numeric constants in double-quotes; enclose string constants
 in single-quotes then double-quotes.   
 
    ( CSV_Col_1   FILLER 
    , CSV_Col_2   FILLER 
    , val_num     "1234"
    , name_txt    "'Bill Bagnnn'"
    , some_dts    DATE "YYYY-MM-DD HH24:MI:SS" 
    )

 Adding a Sequence!
 
 If you have an extra column in your table where you would like to put
 a sequence number, then put that column name at the end of your list
 and the keyword "SEQUENCE". Actually it appears that you can put it 
 anywhere in your list: first, in-between, or last. It also does not have
 to be a numeric column. The two values in parenthesis are the starting
 sequence number and then the increment value. If you are appending to
 a table instead of truncate-and-loading, then you can use "SEQUENCE(MAX,1)".
 That will find the maximum sequence value already in your table an then
 increment from there.
 
    ( id_num      
    , CSV_Col_2   FILLER 
    , val_num  
    , name_txt    "initcap(trim( :name_txt ))"
    , some_dts    DATE "YYYY-MM-DD HH24:MI:SS" 
    , extra_col   SEQUENCE(1,1)
    )

 That's it!


Thursday, August 11, 2011

An Example of a Slightly More Complex Pivot Query

In this example were going to look at a pivot query that looks at the information in two columns to decide in which new pivot column the data should be placed. The pivot in this example works in exactly the same way as in the simple-pivot example; there is just a slight difference in syntax.

The data below represents financial information from different months and states. Three different products are shown with type 1 being a profit and type 2 being a loss. There are two measures, a product count, and the total profit or loss.


-----> Example SQL Code Listing-#1:


   WITH
     w_data  as
       ( select  NULL  as mn  /* Month */
              ,  NULL  as st  /* State */             
              ,  NULL  as pd  /* Product */             
              ,  NULL  as tp  /* Type */             
              ,  NULL  as ct  /* Count */             
              ,  NULL  as pf  /* Profit */             
           from  dual  where  1=0
         ----------------------------------------------------------------------
         union all select  '1994-10', 'IL', 'DoDad', 1,  37,  697.45  from dual
         union all select  '1994-10', 'IL', 'DoDad', 2,  13, -303.03  from dual
         union all select  '1994-10', 'IL', 'Gizmo', 2,  23, -558.90  from dual
         union all select  '1994-10', 'IL', 'Gizmo', 2,  21, -468.51  from dual
         union all select  '1994-10', 'IL', 'Wigit', 1,  69,  867.33  from dual
         union all select  '1994-10', 'IL', 'Wigit', 2,  24, -494.64  from dual
         union all select  '1994-10', 'IL', 'Wigit', 2,   4,  -65.92  from dual
         union all select  '1994-10', 'NY', 'DoDad', 1,  71, 1615.96  from dual
         union all select  '1994-10', 'NY', 'DoDad', 1,  45,  840.60  from dual
         union all select  '1994-10', 'NY', 'DoDad', 2,  19, -352.64  from dual
         union all select  '1994-10', 'NY', 'DoDad', 2,  21, -533.19  from dual
         union all select  '1994-10', 'NY', 'DoDad', 2,   4,  -94.28  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 1, 282, 6426.78  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 1, 105, 2272.20  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 1, 221, 4274.14  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 2,   2,  -14.68  from dual
         union all select  '1994-10', 'NY', 'Wigit', 1,  40,  879.60  from dual
         union all select  '1994-10', 'NY', 'Wigit', 1, 285, 6047.70  from dual
         union all select  '1994-11', 'IL', 'DoDad', 1, 196, 2887.08  from dual
         union all select  '1994-11', 'IL', 'DoDad', 1, 101, 2228.06  from dual
         union all select  '1994-11', 'IL', 'DoDad', 1, 333, 5471.19  from dual
         union all select  '1994-11', 'IL', 'DoDad', 2,   7,  -75.95  from dual
         union all select  '1994-11', 'IL', 'DoDad', 2,  25, -419.50  from dual
         union all select  '1994-11', 'IL', 'Gizmo', 1, 255, 4931.70  from dual
         union all select  '1994-11', 'IL', 'Gizmo', 2,  23, -423.43  from dual
         union all select  '1994-11', 'IL', 'Wigit', 1,  36,  700.92  from dual
         union all select  '1994-11', 'IL', 'Wigit', 1, 329, 6385.89  from dual
         union all select  '1994-11', 'IL', 'Wigit', 2,  19, -463.79  from dual
         union all select  '1994-11', 'NY', 'DoDad', 1, 373, 8642.41  from dual
         union all select  '1994-11', 'NY', 'Gizmo', 1,  21,  476.91  from dual
         union all select  '1994-11', 'NY', 'Gizmo', 2,  14,  -99.40  from dual
         union all select  '1994-11', 'NY', 'Wigit', 1, 394, 9550.56  from dual
         union all select  '1994-11', 'NY', 'Wigit', 1,  57, 1046.52  from dual
         union all select  '1994-11', 'NY', 'Wigit', 1, 133, 2386.02  from dual
         union all select  '1994-11', 'NY', 'Wigit', 2,  23, -543.49  from dual
         union all select  '1994-12', 'IL', 'DoDad', 1, 280, 6134.80  from dual
         union all select  '1994-12', 'IL', 'DoDad', 1,  68,  948.60  from dual
         union all select  '1994-12', 'IL', 'DoDad', 2,   6, -148.98  from dual
         union all select  '1994-12', 'IL', 'Gizmo', 1, 252, 5503.68  from dual
         union all select  '1994-12', 'IL', 'Gizmo', 1, 156, 2322.84  from dual
         union all select  '1994-12', 'IL', 'Gizmo', 2,  14, -236.60  from dual
         union all select  '1994-12', 'IL', 'Wigit', 1, 132, 2109.36  from dual
         union all select  '1994-12', 'IL', 'Wigit', 1, 339, 8041.08  from dual
         union all select  '1994-12', 'IL', 'Wigit', 1, 197, 3520.39  from dual
         union all select  '1994-12', 'IL', 'Wigit', 2,  20, -341.60  from dual
         union all select  '1994-12', 'IL', 'Wigit', 2,   4,  -32.20  from dual
         union all select  '1994-12', 'NY', 'DoDad', 1, 308, 5528.60  from dual
         union all select  '1994-12', 'NY', 'DoDad', 1, 335, 6673.20  from dual
         union all select  '1994-12', 'NY', 'DoDad', 1, 109, 2237.77  from dual
         union all select  '1994-12', 'NY', 'DoDad', 2,  17, -326.91  from dual
         union all select  '1994-12', 'NY', 'DoDad', 2,   4,  -83.32  from dual
         union all select  '1994-12', 'NY', 'DoDad', 2,   4,  -51.04  from dual
         union all select  '1994-12', 'NY', 'Wigit', 1, 258, 3397.86  from dual
         union all select  '1994-12', 'NY', 'Wigit', 1,  23,  302.91  from dual
         union all select  '1994-12', 'NY', 'Wigit', 2,   9, -170.73  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 1, 296, 3797.68  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 2,  17, -131.41  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 2,  17, -158.78  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 2,   1,  -13.38  from dual
         ----------------------------------------------------------------------
       )
   select *
     from  w_data ;



So, here's the pivot query. The "FOR" keyword precedes the two columns that contain the values that will be used to determine the pivot groups. Then, following the "IN" keyword, are the list of those values and the name of each pivot-group. Notice that there are two measures and that each pivot-group will include both measures to create twelve (6 x 2 = 12) new columns. This query displays the month, state, and then all twelve new pivot-columns.


-----> Example SQL Code Listing-#2:
   
     
   WITH
     w_data  as
       ( select  NULL  as mn  /* Month */
              ,  NULL  as st  /* State */             
              ,  NULL  as pd  /* Product */             
              ,  NULL  as tp  /* Type */             
              ,  NULL  as ct  /* Count */             
              ,  NULL  as pf  /* Profit */             
           from  dual  where  1=0
         ----------------------------------------------------------------------
         union all select  '1994-10', 'IL', 'DoDad', 1,  37,  697.45  from dual
         union all select  '1994-10', 'IL', 'DoDad', 2,  13, -303.03  from dual
         union all select  '1994-10', 'IL', 'Gizmo', 2,  23, -558.90  from dual
         union all select  '1994-10', 'IL', 'Gizmo', 2,  21, -468.51  from dual
         union all select  '1994-10', 'IL', 'Wigit', 1,  69,  867.33  from dual
         union all select  '1994-10', 'IL', 'Wigit', 2,  24, -494.64  from dual
         union all select  '1994-10', 'IL', 'Wigit', 2,   4,  -65.92  from dual
         union all select  '1994-10', 'NY', 'DoDad', 1,  71, 1615.96  from dual
         union all select  '1994-10', 'NY', 'DoDad', 1,  45,  840.60  from dual
         union all select  '1994-10', 'NY', 'DoDad', 2,  19, -352.64  from dual
         union all select  '1994-10', 'NY', 'DoDad', 2,  21, -533.19  from dual
         union all select  '1994-10', 'NY', 'DoDad', 2,   4,  -94.28  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 1, 282, 6426.78  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 1, 105, 2272.20  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 1, 221, 4274.14  from dual
         union all select  '1994-10', 'NY', 'Gizmo', 2,   2,  -14.68  from dual
         union all select  '1994-10', 'NY', 'Wigit', 1,  40,  879.60  from dual
         union all select  '1994-10', 'NY', 'Wigit', 1, 285, 6047.70  from dual
         union all select  '1994-11', 'IL', 'DoDad', 1, 196, 2887.08  from dual
         union all select  '1994-11', 'IL', 'DoDad', 1, 101, 2228.06  from dual
         union all select  '1994-11', 'IL', 'DoDad', 1, 333, 5471.19  from dual
         union all select  '1994-11', 'IL', 'DoDad', 2,   7,  -75.95  from dual
         union all select  '1994-11', 'IL', 'DoDad', 2,  25, -419.50  from dual
         union all select  '1994-11', 'IL', 'Gizmo', 1, 255, 4931.70  from dual
         union all select  '1994-11', 'IL', 'Gizmo', 2,  23, -423.43  from dual
         union all select  '1994-11', 'IL', 'Wigit', 1,  36,  700.92  from dual
         union all select  '1994-11', 'IL', 'Wigit', 1, 329, 6385.89  from dual
         union all select  '1994-11', 'IL', 'Wigit', 2,  19, -463.79  from dual
         union all select  '1994-11', 'NY', 'DoDad', 1, 373, 8642.41  from dual
         union all select  '1994-11', 'NY', 'Gizmo', 1,  21,  476.91  from dual
         union all select  '1994-11', 'NY', 'Gizmo', 2,  14,  -99.40  from dual
         union all select  '1994-11', 'NY', 'Wigit', 1, 394, 9550.56  from dual
         union all select  '1994-11', 'NY', 'Wigit', 1,  57, 1046.52  from dual
         union all select  '1994-11', 'NY', 'Wigit', 1, 133, 2386.02  from dual
         union all select  '1994-11', 'NY', 'Wigit', 2,  23, -543.49  from dual
         union all select  '1994-12', 'IL', 'DoDad', 1, 280, 6134.80  from dual
         union all select  '1994-12', 'IL', 'DoDad', 1,  68,  948.60  from dual
         union all select  '1994-12', 'IL', 'DoDad', 2,   6, -148.98  from dual
         union all select  '1994-12', 'IL', 'Gizmo', 1, 252, 5503.68  from dual
         union all select  '1994-12', 'IL', 'Gizmo', 1, 156, 2322.84  from dual
         union all select  '1994-12', 'IL', 'Gizmo', 2,  14, -236.60  from dual
         union all select  '1994-12', 'IL', 'Wigit', 1, 132, 2109.36  from dual
         union all select  '1994-12', 'IL', 'Wigit', 1, 339, 8041.08  from dual
         union all select  '1994-12', 'IL', 'Wigit', 1, 197, 3520.39  from dual
         union all select  '1994-12', 'IL', 'Wigit', 2,  20, -341.60  from dual
         union all select  '1994-12', 'IL', 'Wigit', 2,   4,  -32.20  from dual
         union all select  '1994-12', 'NY', 'DoDad', 1, 308, 5528.60  from dual
         union all select  '1994-12', 'NY', 'DoDad', 1, 335, 6673.20  from dual
         union all select  '1994-12', 'NY', 'DoDad', 1, 109, 2237.77  from dual
         union all select  '1994-12', 'NY', 'DoDad', 2,  17, -326.91  from dual
         union all select  '1994-12', 'NY', 'DoDad', 2,   4,  -83.32  from dual
         union all select  '1994-12', 'NY', 'DoDad', 2,   4,  -51.04  from dual
         union all select  '1994-12', 'NY', 'Wigit', 1, 258, 3397.86  from dual
         union all select  '1994-12', 'NY', 'Wigit', 1,  23,  302.91  from dual
         union all select  '1994-12', 'NY', 'Wigit', 2,   9, -170.73  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 1, 296, 3797.68  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 2,  17, -131.41  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 2,  17, -158.78  from dual
         union all select  '1994-12', 'NY', 'Gizmo', 2,   1,  -13.38  from dual
         ----------------------------------------------------------------------
       )
     select  mn,  st
          ,  dodad_p_cnt,  gizmo_p_cnt,  wigit_p_cnt
          ,  dodad_p_amt,  gizmo_p_amt,  wigit_p_amt
          ,  dodad_l_cnt,  gizmo_l_cnt,  wigit_l_cnt
          ,  dodad_l_amt,  gizmo_l_amt,  wigit_l_amt
       from  w_data
      PIVOT  ( sum(ct) as cnt
             , sum(pf) as amt
             FOR ( pd, tp )
             IN  ( ( 'DoDad', 1 )  as dodad_p
                 , ( 'DoDad', 2 )  as dodad_l
                 , ( 'Gizmo', 1 )  as gizmo_p
                 , ( 'Gizmo', 2 )  as gizmo_l
                 , ( 'Wigit', 1 )  as wigit_p
                 , ( 'Wigit', 2 )  as wigit_l
                 )
             )
      order  by mn, st
     ;



And here is the result:



For each month and state's aggregated output row, all the input rows that match the values for a particular pivot-group are grouped together to find the measures. For example, If you look at the last 3 lines of input data they will all aggregate into the December and New York output line (the last one in the result). Actually the last 13 lines of input data will aggregate into the last row of the result; however, The last 3 lines are also part of the "('Gizmo', 2)" pivot-group so they will show up only in the Gizmo-Loss columns. The counts and amounts of these 3 last lines are summed up and show up on the last line of the result. The counts total up to 35 and show up as shown by the red arrow; and the amounts total up to a 303.57 loss and show up as shown by the green arrow.

Monday, August 8, 2011

Oracle's PIVOT Clause -- A Simple Example

Oracle's PIVOT clause allows us to take information that is 'stored in rows' and 'pivot-it-out' so that it is 'stored in columns'.

We need a column (or set of columns) that contains a finite number of identifying values -- these values will determine what new columns are created. In the following example the "qtr" column contains the yearly quarter. Instead of showing our data 'vertically' as shown below, we wish to take the information in the "qtr" column and spread our data out 'horizontally' so we have a column (or set of columns) for each quarter.




Here is our sample data and a simple PIVOT query that accomplish this pivot by quarter.


   WITH  
     w_data  as
       ( select   11 acct,  1 qtr,  23.29 m1   from dual union all
         select   11     ,  2    ,  81.72      from dual union all
         select   11     ,  3    ,  16.72      from dual union all
         select   11     ,  4    ,  20.00      from dual union all
         select   11     ,  4    ,  20.00      from dual union all
         select   11     ,  4    ,  20.00      from dual union all
         select   35     ,  2    ,  11.53      from dual union all
         select   35     ,  3    ,  51.27      from dual union all
         select   35     ,  4    ,  76.14      from dual union all
         select   44     ,  1    ,  53.63      from dual union all
         select   44     ,  2    ,  38.58      from dual union all
         select   44     ,  3    ,  29.76      from dual union all
         select   44     ,  4    ,  47.12      from dual 
       )
   select  *
     from  w_data
    PIVOT  ( sum(m1)   as sm
           , count(*)  as ct
           FOR  qtr  IN 
              ( 1 as q1
              , 2 as q2
              , 3 as q3
              , 4 as q4 
              )
           ) 
    order  by acct ;


And here is the result from running the above query:


Explanation:

First off -- the PIVOT clause is an aggregate operation. It's like having a "Group By" clause in your query without having one spelled out. Any column that is not used inside the "PIVOT" clause is treated as a "Group By" attribute. In this example the "acct" column is the only one not used in the pivot clause and thus the query will "Group By" the "acct" values -- We see that we get one final row for each "acct" value.

The first part of the PIVOT clause contains the definitions of one or more measures (comma separated). In this case we define the "sum(m1)" as a measure called "sm" and a count of the rows in each group as a measure called "ct". If you only define one measure there is no need to specify an alias name -- two or more measures... alias names must be defined.

After the "FOR" key-word, the column (or set of columns) that contains the identifiers that will determine our newly generated 'pivot' columns must be named. In this case the "qtr" column will contain the pivot-column identifiers.

After the "IN" key-word a list of all the identifiers that are of interest is included in parenthesis with their accompanying alias names. Depending on the data and number of identifying columns listed in the "FOR" section, alias names may or may not be required. When the query is executed a "pivot-group" is created for each distinct value in this list.

A new PIVOT column is generated for each defined measure for each "pivot-group". In this case we have 4 "pivot-groups" and 2 measures -- So 8 new columns are generated (4 x 2 = 8). The newly generated column names are created by joining the "pivot-group" aliases with the measure's aliases (joined with an underscore).

The "m1" measure is aggregated -- the values are added up if there is more than one value for each "acct" and "qtr". The "ct" columns show how many values were aggregated for each "acct" and "qtr". In the select clause of the query, you may choose to display as many or as few of the newly generated pivot columns as you wish to display -- in any order.



Here's one final example showing that you can display the columns in any order, and that you can even use the columns as normal columns (i.e. using them as part of a calculation).

 
   /* Include the WITH clause sample data from above here. */ 
   select  acct
        ,  q1_sm , q2_sm , q3_sm , q4_sm
        ,  q1_ct + q2_ct + q3_ct + q4_ct  as ttl_ct
     from  w_data
    PIVOT  ( sum(m1)   as sm
           , count(*)  as ct
           FOR  qtr  in 
              ( 1 as q1
              , 2 as q2
              , 3 as q3
              , 4 as q4 
              )
           ) 
    order  by acct ;




Thursday, August 4, 2011

A basic Hierarchical (Connect-By) Query

Here is a sample data-set that shows a list of people with their associated "ID" numbers. The "prnt" column defines a hierarchical relationship by giving the "ID" number of the person directly above them in the hierarchy (i.e. their boss or parent). Notice that the first two records don't have a 'parent' -- they are top records in the hierarchy.

/* --- code --- */


     WITH  
       w_data  as 
         ( select   1 id, 'Albert'    nm, null prnt  from dual union all
           select   2   , 'Rhonda'      , null       from dual union all
           select   3   , 'Ralph'       ,    1       from dual union all
           select   4   , 'Erma'        ,    1       from dual union all
           select   5   , 'Leopold'     ,    2       from dual union all
           select   6   , 'Zanthia'     ,    3       from dual union all
           select   7   , 'Reginald'    ,    3       from dual union all
           select   8   , 'Hillary'     ,    4       from dual union all
           select   9   , 'Orpheus'     ,    4       from dual union all
           select  10   , 'Linda'       ,    4       from dual union all
           select  11   , 'Ronald'      ,    5       from dual union all
           select  12   , 'Deborah'     ,    5       from dual union all
           select  13   , 'Andrew'      ,    6       from dual union all
           select  14   , 'Maggie'      ,    6       from dual union all
           select  15   , 'Fletcher'    ,    6       from dual union all
           select  16   , 'Wanda'       ,    7       from dual union all
           select  17   , 'Dexter'      ,    8       from dual union all
           select  18   , 'Thelma'      ,    9       from dual union all
           select  19   , 'Wilson'      ,    9       from dual union all
           select  20   , 'Amanda'      ,   10       from dual union all
           select  21   , 'Thor'        ,   10       from dual union all
           select  22   , 'Glenda'      ,   11       from dual union all
           select  23   , 'Norman'      ,   11       from dual union all
           select  24   , 'Edith'       ,   11       from dual union all
           select  25   , 'Arvid'       ,   12       from dual union all
           select  26   , 'Wilma'       ,   12       from dual union all
           select  27   , 'Frederic'    ,   12       from dual union all
           select  28   , 'Unice'       ,   12       from dual 
         ) 
     select  *  from  w_data ;




Now, here's a simple query that uses the prior data and constructs a hierarchical list that clearly shows the parent-child connections. An explanation and the sample output follows. To run this query, combine the sample data from the WITH clause above with the query below.

/* --- code --- */
        
  
    /* Include the WITH clause sample data from above here. */      
    select  lpad( level, level*2 ) as "Level"
         ,  level 
         ,  id
         ,  nm
         ,  prnt
         ,  sys_connect_by_path( nm , ', ' ) as "Path"
      from  w_data
    START WITH ( id in (1,2) )
    CONNECT BY ( prnt = PRIOR id ) 
    ;



The components of the query are described below:

"START WITH"  The query needs to know how to locate the top record or records in the hierarchy. In this case it specifically states that the ID number must be in the list: (1,2). Any properly formed condition would work:
  • START WITH (prnt is NULL)                               -- those who don't have a parent listed
  • START WITH (nm='Albert' or nm='Rhonda')         -- a list of specific names
  • START WITH  (id in ( select id from w_data where prnt is NULL ))      -- even a query can work
You don't even have to 'start with' the top records in the source data. You can specify any record and the results will show that record as the top record and then trace the hierarchy down from their. Try the following and observe that you only get back the specified records and the ones below them.
  • START WITH (id in (3,9))
  • START WITH (nm in ('Linda'.'Reginald')
"CONNECT BY"  There needs to be a clear definition of how a parent record connects to a child record. The key word "PRIOR" is used to refer to the previous or 'higher' record in the hierarchy. In this case the "prnt" column of a record points to the PRIOR or higher record's "id" column. It doesn't matter which of the following you use, they both say the same thing.
  • CONNECT BY (prnt = PRIOR id)
  • CONNECT BY (PRIOR id = prnt)
However if you put the "PRIOR" in front of the "prnt" then you are trying to 'walk' the hierarchy backwards. (we'll get to an example of that).

"LEVEL" is a key word that indicates how deep that particular row will occur in the hierarchy. Note that the second column simply displays the "LEVEL" value where the first column uses the "LEVEL" value to print itself with additional spacing as it goes deeper & deeper into the hierarchy. This gives a nice visual component to the hierarchy that allows us to easily see its structure.

"SYS_CONNECT_BY_PATH"  is a hierarchical function that returns the complete path from the top record to the current record. In this case we are looking at a list of comma separated names. If we would like to see the path as "ID" numbers, we could use the following:
  • sys_connect_by_path(id,'/')




Here are a few other things you can do with simple Connect-By queries:

/* --- code --- */
    
      
     WITH  
       w_data  as 
         ( select   1 id, 'Albert'    nm, null prnt  from dual union all
           select   2   , 'Rhonda'      , null       from dual union all
           select   3   , 'Ralph'       ,    1       from dual union all
           select   4   , 'Erma'        ,    1       from dual union all
           select   5   , 'Leopold'     ,    2       from dual union all
           select   6   , 'Zanthia'     ,    3       from dual union all
           select   7   , 'Reginald'    ,    3       from dual union all
           select   8   , 'Hillary'     ,    4       from dual union all
           select   9   , 'Ronald'      ,    5       from dual union all
           select  10   , 'Deborah'     ,    5       from dual 
         ) 
     SELECT  lpad( level, level*3 ) as "Level"
          ,  id
          ,  nm
          ,  prnt
          ,  substr(sys_connect_by_path(nm,', '),3)  as "Path"
          ,  connect_by_root nm  as top_boss
          ,  connect_by_isleaf   as lowest
       from  w_data
     START WITH ( id in (1,2) )
     CONNECT BY ( prnt = PRIOR id  ) 
      order  siblings by id
     ;
   


First of all, we've fixed the leading comma in the "Path" column by using a substring function.

"CONNECT_BY_ROOT"  This opperator simply gives us the name, id number, or whatever it is we specify for the top-level record of the current hierarchy.

"CONNECT_BY_ISLEAF"  is a 0/1 flag that simply tells us if the current row is the lowest one on this branch of the hierarchy.

"ORDER SIBLINGS BY"  is just a modified version of the normal "Order By" clause. This allows us to order the sibling records of any parent by one of the attributes. In this example we may want to order the siblings by either id number or name. Notice below that Ralph and Erma are siblings and we have specified ordering them by id (3, 4). If we changed to ordering them by name then Erma's entire branch would come before Ralph's.




And finally, in the following example, we've selected all the "ID" numbers that are never listed as parents and used this as our "START WITH" values. We're going to walk-the-tree backwards, or turn the hierarchy upside-down. The only other thing we need to change is the "PRIOR" key word -- we'll now place it in front of the "prnt" column. This effectively reverses our hierarchy.

/* --- code --- */
          

     WITH  
       w_data  as 
         ( select   1 id, 'Albert'    nm, null prnt  from dual union all
           select   2   , 'Rhonda'      , null       from dual union all
           select   3   , 'Ralph'       ,    1       from dual union all
           select   4   , 'Erma'        ,    1       from dual union all
           select   5   , 'Leopold'     ,    2       from dual union all
           select   6   , 'Zanthia'     ,    3       from dual union all
           select   7   , 'Reginald'    ,    3       from dual union all
           select   8   , 'Hillary'     ,    4       from dual union all
           select   9   , 'Ronald'      ,    5       from dual union all
           select  10   , 'Deborah'     ,    5       from dual 
         ) 
     SELECT  lpad( level, level*3 ) as "Level"
          ,  id
          ,  nm
          ,  prnt
          ,  substr(sys_connect_by_path(nm,', '),3)  as "Path"
          ,  connect_by_root nm  as top_boss
          ,  connect_by_isleaf   as lowest
       from  w_data
     START WITH (id in (select id from w_data minus select prnt from w_data))
     CONNECT BY ( PRIOR prnt = id ) 
      order  siblings by id
     ;

   

Note the results. Lowest members now are listed at the #1 level and the path walks its way down to the original parents at the bottom.

Tuesday, August 2, 2011

Analytic Functions: LAG( ) -- Grab a value from a previous row.

The LAG function can be used to return a value from the previous row. This becomes usefull if, for example, you want to calculate a difference between a previous measurment and a current one. Of course this implies that you properly define the order in which the rows are to occur. The LAG function, as other analytic functions, can opperate over the whole data-set; or groups (called partitions) can be defined so that for a particular row, the LAG function only looks for values within a specified subset of the entire data set.

Let's take for instance the following data set: measurements taken for two different people at different times.

-----> Example SQL Code Listing-#1:

   WITH
     w_collected_data  as
       ( select  NULL  as nm
              ,  NULL  as dts
              ,  NULL  as measure
           from  dual  where  1=0
         -----------------------------------------------------------------------------------------------
         union all select 'Liz', to_date('2011-05-12 08:12:00','YYYY-MM-DD HH24:MI:SS'), 31.4  from dual 
         union all select 'Liz', to_date('2011-05-12 10:21:00','YYYY-MM-DD HH24:MI:SS'), 31.1  from dual 
         union all select 'Bob', to_date('2011-05-12 10:23:00','YYYY-MM-DD HH24:MI:SS'), 23.5  from dual 
         union all select 'Bob', to_date('2011-05-12 11:32:00','YYYY-MM-DD HH24:MI:SS'), 24.9  from dual 
         union all select 'Liz', to_date('2011-05-12 12:16:00','YYYY-MM-DD HH24:MI:SS'), 30.7  from dual 
         union all select 'Bob', to_date('2011-05-12 12:44:00','YYYY-MM-DD HH24:MI:SS'), 26.1  from dual 
         union all select 'Bob', to_date('2011-05-12 13:18:00','YYYY-MM-DD HH24:MI:SS'), 27.3  from dual 
         union all select 'Bob', to_date('2011-05-12 14:27:00','YYYY-MM-DD HH24:MI:SS'), 27.1  from dual 
         union all select 'Liz', to_date('2011-05-12 14:36:00','YYYY-MM-DD HH24:MI:SS'), 29.5  from dual 
         union all select 'Bob', to_date('2011-05-12 15:13:00','YYYY-MM-DD HH24:MI:SS'), 26.8  from dual 
         union all select 'Liz', to_date('2011-05-12 17:01:00','YYYY-MM-DD HH24:MI:SS'), 29.2  from dual
         -----------------------------------------------------------------------------------------------
       )
   select  *
     from  w_collected_data
    order  by  nm, dts
   ;

-------------------------
-- Results
-------------------------

   NM   DTS                  MEASURE
   ---  -------------------  -------
   Bob  2011-05-12 10:23:00  23.5
   Bob  2011-05-12 11:32:00  24.9
   Bob  2011-05-12 12:44:00  26.1
   Bob  2011-05-12 13:18:00  27.3
   Bob  2011-05-12 14:27:00  27.1
   Bob  2011-05-12 15:13:00  26.8
   Liz  2011-05-12 08:12:00  31.4
   Liz  2011-05-12 10:21:00  31.1
   Liz  2011-05-12 12:16:00  30.7
   Liz  2011-05-12 14:36:00  29.5
   Liz  2011-05-12 17:01:00  29.2



Now let's suppose that we want to find the difference between one measurement to the next.

-----> Example SQL Code Listing-#2:

   WITH
     w_collected_data  as
       ( select  NULL  as nm
              ,  NULL  as dts
              ,  NULL  as measure
           from  dual  where  1=0
         -----------------------------------------------------------------------------------------------
         union all select 'Liz', to_date('2011-05-12 08:12:00','YYYY-MM-DD HH24:MI:SS'), 31.4  from dual 
         union all select 'Liz', to_date('2011-05-12 10:21:00','YYYY-MM-DD HH24:MI:SS'), 31.1  from dual 
         union all select 'Bob', to_date('2011-05-12 10:23:00','YYYY-MM-DD HH24:MI:SS'), 23.5  from dual 
         union all select 'Bob', to_date('2011-05-12 11:32:00','YYYY-MM-DD HH24:MI:SS'), 24.9  from dual 
         union all select 'Liz', to_date('2011-05-12 12:16:00','YYYY-MM-DD HH24:MI:SS'), 30.7  from dual 
         union all select 'Bob', to_date('2011-05-12 12:44:00','YYYY-MM-DD HH24:MI:SS'), 26.1  from dual 
         union all select 'Bob', to_date('2011-05-12 13:18:00','YYYY-MM-DD HH24:MI:SS'), 27.3  from dual 
         union all select 'Bob', to_date('2011-05-12 14:27:00','YYYY-MM-DD HH24:MI:SS'), 27.1  from dual 
         union all select 'Liz', to_date('2011-05-12 14:36:00','YYYY-MM-DD HH24:MI:SS'), 29.5  from dual 
         union all select 'Bob', to_date('2011-05-12 15:13:00','YYYY-MM-DD HH24:MI:SS'), 26.8  from dual 
         union all select 'Liz', to_date('2011-05-12 17:01:00','YYYY-MM-DD HH24:MI:SS'), 29.2  from dual
         -----------------------------------------------------------------------------------------------
       )
   select  nm
        ,  dts
        ,  measure
        ,            lag( measure ) over ( order by dts )  as previous
        ,  measure - lag( measure ) over ( order by dts )  as msr_increase
     from  w_collected_data
    order  by  dts
   ;

-------------------------
-- Results
-------------------------

   NM   DTS                  MEASURE  PREVIOUS  MSR_INCREASE
   ---  -------------------  -------  --------  ------------
   Liz  2011-05-12 08:12:00  31.4
   Liz  2011-05-12 10:21:00  31.1     31.4      -0.3
   Bob  2011-05-12 10:23:00  23.5     31.1      -7.6
   Bob  2011-05-12 11:32:00  24.9     23.5       1.4
   Liz  2011-05-12 12:16:00  30.7     24.9       5.8
   Bob  2011-05-12 12:44:00  26.1     30.7      -4.6
   Bob  2011-05-12 13:18:00  27.3     26.1       1.2
   Bob  2011-05-12 14:27:00  27.1     27.3      -0.2
   Liz  2011-05-12 14:36:00  29.5     27.1       2.4
   Bob  2011-05-12 15:13:00  26.8     29.5      -2.7
   Liz  2011-05-12 17:01:00  29.2     26.8       2.4


Notice that the "lag(measure)" gives us the previous measure where the "over (order by dts)" defines how the rows are to be sorted when finding the previous row. The entire analytic function given by "lag(measure) over (order by dts)" returns the previous row's measure value if the rows were to be sorted by "dts". The first row has no previous value because there was no previous row to pull it from -- it's value is set to NULL.

In an analytic function the "over (order by *)" is only used to determine which row the value will be taken from -- it doesn't determine how the final output data-set will have its rows sorted; that is always determined in the "Order By" clause of the query.

Also, notice that this new analytic function, "lag(measure) over (order by dts)", can now be used in a calculation (the measure minus the previous measure) to show how much the measure changed from one row to the next.



Next we might realize that the Bob's measurements don't relate to the measurements of Liz. We would like to look at the two people's measurements seperately -- not inner-mixed. This means we will need to split-up the data by the "NM" column -- in analytic functions we do this via the "partition by" operator within the "over" clause.

-----> Example SQL Code Listing-#3:

   WITH
     w_collected_data  as
       ( select  NULL  as nm
              ,  NULL  as dts
              ,  NULL  as measure
           from  dual  where  1=0
         -----------------------------------------------------------------------------------------------
         union all select 'Liz', to_date('2011-05-12 08:12:00','YYYY-MM-DD HH24:MI:SS'), 31.4  from dual 
         union all select 'Liz', to_date('2011-05-12 10:21:00','YYYY-MM-DD HH24:MI:SS'), 31.1  from dual 
         union all select 'Bob', to_date('2011-05-12 10:23:00','YYYY-MM-DD HH24:MI:SS'), 23.5  from dual 
         union all select 'Bob', to_date('2011-05-12 11:32:00','YYYY-MM-DD HH24:MI:SS'), 24.9  from dual 
         union all select 'Liz', to_date('2011-05-12 12:16:00','YYYY-MM-DD HH24:MI:SS'), 30.7  from dual 
         union all select 'Bob', to_date('2011-05-12 12:44:00','YYYY-MM-DD HH24:MI:SS'), 26.1  from dual 
         union all select 'Bob', to_date('2011-05-12 13:18:00','YYYY-MM-DD HH24:MI:SS'), 27.3  from dual 
         union all select 'Bob', to_date('2011-05-12 14:27:00','YYYY-MM-DD HH24:MI:SS'), 27.1  from dual 
         union all select 'Liz', to_date('2011-05-12 14:36:00','YYYY-MM-DD HH24:MI:SS'), 29.5  from dual 
         union all select 'Bob', to_date('2011-05-12 15:13:00','YYYY-MM-DD HH24:MI:SS'), 26.8  from dual 
         union all select 'Liz', to_date('2011-05-12 17:01:00','YYYY-MM-DD HH24:MI:SS'), 29.2  from dual
         -----------------------------------------------------------------------------------------------
       )
   select  nm
        ,  dts
        ,  measure
        ,               lag( measure ) over (partition by nm order by dts)  as previous
        ,     measure - lag( measure ) over (partition by nm order by dts)  as msr_increase
        ,  round((dts - lag( dts     ) over (partition by nm order by dts))*24*60)  as min_between
     from  w_collected_data
    order  by  nm, dts
   ;

-------------------------
-- Results
-------------------------

   NM   DTS                  MEASURE  PREVIOUS  MSR_INCREASE  MIN_BETWEEN
   ---  -------------------  -------  --------  ------------  -----------
   Bob  2011-05-12 10:23:00  23.5
   Bob  2011-05-12 11:32:00  24.9     23.5       1.4           69
   Bob  2011-05-12 12:44:00  26.1     24.9       1.2           72
   Bob  2011-05-12 13:18:00  27.3     26.1       1.2           34
   Bob  2011-05-12 14:27:00  27.1     27.3      -0.2           69
   Bob  2011-05-12 15:13:00  26.8     27.1      -0.3           46
   Liz  2011-05-12 08:12:00  31.4
   Liz  2011-05-12 10:21:00  31.1     31.4      -0.3          129
   Liz  2011-05-12 12:16:00  30.7     31.1      -0.4          115
   Liz  2011-05-12 14:36:00  29.5     30.7      -1.2          140
   Liz  2011-05-12 17:01:00  29.2     29.5      -0.3          145


Now we see two seperate groups of data, one for Bob and one for Liz. In an analytic function we can 'partition by' mutliple values/columns and sort by multiple values/columns. This makes the analytic functions very powerful. For instance we might have a LAG function that looks like this: "lag(measure) over (partition by region, store, clerk order by product, sales_date)".

Additionally, the previous example added a column to show how many minutes were between the measurments using this LAG function: "lag(dts) over (partition by nm order by dts)".



Another feature of the LAG function is that you can specify how many rows to 'look-back'. The default is 1 -- or the previous row; but you can change that to 2 as shown in the following example. A third parameter can also be supplied as a defauld value (instead of NULL) for rows that don't have a previous value.

-----> Example SQL Code Listing-#4:

   WITH
     w_collected_data  as
       ( select  NULL  as nm
              ,  NULL  as dts
              ,  NULL  as measure
           from  dual  where  1=0
         -----------------------------------------------------------------------------------------------
         union all select 'Liz', to_date('2011-05-12 08:12:00','YYYY-MM-DD HH24:MI:SS'), 31.4  from dual 
         union all select 'Liz', to_date('2011-05-12 10:21:00','YYYY-MM-DD HH24:MI:SS'), 31.1  from dual 
         union all select 'Bob', to_date('2011-05-12 10:23:00','YYYY-MM-DD HH24:MI:SS'), 23.5  from dual 
         union all select 'Bob', to_date('2011-05-12 11:32:00','YYYY-MM-DD HH24:MI:SS'), 24.9  from dual 
         union all select 'Liz', to_date('2011-05-12 12:16:00','YYYY-MM-DD HH24:MI:SS'), 30.7  from dual 
         union all select 'Bob', to_date('2011-05-12 12:44:00','YYYY-MM-DD HH24:MI:SS'), 26.1  from dual 
         union all select 'Bob', to_date('2011-05-12 13:18:00','YYYY-MM-DD HH24:MI:SS'), 27.3  from dual 
         union all select 'Bob', to_date('2011-05-12 14:27:00','YYYY-MM-DD HH24:MI:SS'), 27.1  from dual 
         union all select 'Liz', to_date('2011-05-12 14:36:00','YYYY-MM-DD HH24:MI:SS'), 29.5  from dual 
         union all select 'Bob', to_date('2011-05-12 15:13:00','YYYY-MM-DD HH24:MI:SS'), 26.8  from dual 
         union all select 'Liz', to_date('2011-05-12 17:01:00','YYYY-MM-DD HH24:MI:SS'), 29.2  from dual
         -----------------------------------------------------------------------------------------------
       )
   select  nm
        ,  dts
        ,  measure
        ,  lag( measure, 2, 0 ) over (partition by nm order by dts)  as previous
     from  w_collected_data
    order  by  nm, dts
   ;

-------------------------
-- Results
-------------------------

   NM   DTS                  MEASURE  PREVIOUS
   ---  -------------------  -------  --------
   Bob  2011-05-12 10:23:00  23.5     0
   Bob  2011-05-12 11:32:00  24.9     0
   Bob  2011-05-12 12:44:00  26.1     23.5
   Bob  2011-05-12 13:18:00  27.3     24.9
   Bob  2011-05-12 14:27:00  27.1     26.1
   Bob  2011-05-12 15:13:00  26.8     27.3
   Liz  2011-05-12 08:12:00  31.4     0
   Liz  2011-05-12 10:21:00  31.1     0
   Liz  2011-05-12 12:16:00  30.7     31.4
   Liz  2011-05-12 14:36:00  29.5     31.1
   Liz  2011-05-12 17:01:00  29.2     30.7




One last example. Here's a data-set that shows the ID numbers for 3 different runners passing through checkpoints 0-3. A LAG function is used to obtain the previous "clock" value, and then a calculation is made with that function to show the minutes between each check-point.

-----> Example SQL Code Listing-#5:

   WITH
     w_check_points  as
       ( select  NULL  as id
              ,  NULL  as ckpt
              ,  NULL  as clock
           from  dual  where 1=0
         ----------------------------------------------
         union all select  12, 0, '00:00:00'  from dual 
         union all select  47, 0, '00:00:00'  from dual 
         union all select  83, 0, '00:00:00'  from dual 
         union all select  12, 1, '01:00:16'  from dual 
         union all select  47, 1, '00:43:27'  from dual 
         union all select  83, 1, '00:40:18'  from dual 
         union all select  12, 2, '02:12:47'  from dual 
         union all select  47, 2, '01:32:53'  from dual 
         union all select  83, 2, '01:49:27'  from dual 
         union all select  12, 3, '03:27:14'  from dual 
         union all select  47, 3, '02:25:04'  from dual 
         union all select  83, 3, '03:31:38'  from dual
         ----------------------------------------------
       )
   select  id
        ,  ckpt
        ,  clock
        ,  lag(clock) over (partition by id  order by ckpt)  as prev_clock
        ,  round
             ( ( to_date(clock, 'HH24:MI:SS') 
               - to_date(lag(clock) over (partition by id order by ckpt), 'HH24:MI:SS')
               ) * 24 * 60
             , 3 
             )  as min
     from  w_check_points
    order  by  id, ckpt
   ;

-------------------------
-- Results
-------------------------

   ID   CKPT  CLOCK     PREV_CLOCK  MIN
   ---  ----  --------  ----------  ---------
   12   0     00:00:00
   12   1     01:00:16  00:00:00     60.267
   12   2     02:12:47  01:00:16     72.517
   12   3     03:27:14  02:12:47     74.45
   47   0     00:00:00
   47   1     00:43:27  00:00:00     43.45
   47   2     01:32:53  00:43:27     49.433
   47   3     02:25:04  01:32:53     52.183
   83   0     00:00:00
   83   1     00:40:18  00:00:00     40.3
   83   2     01:49:27  00:40:18     69.15
   83   3     03:31:38  01:49:27    102.183

Monday, August 1, 2011

Dates: Grouping date-time values into {n} minute buckets.

Sometimes it becomes desirable, in a query, to group up your date values into buckets of a given time-length. When creating charts or doing statistical analysis, you sometimes want, for instance, all the date-times to not include the minutes and seconds -- in other words, you want to bucket-up your values by hour. All the values from 3:00pm to 3:59:59pm should just be considered as 3:00pm values.

For some of the simple ones, the functions ROUND( ) and TRUNC( ) are very effective. For example, if you only want to look at your date-time values to the nearest hour you could group-by the ROUND(dts, 'HH' ). Or if you want to eliminate the time component of the dates and just use the date part you could group-by TRUNC(dts). The following are some of the common examples:

Assuming "dts" is your date column, 
the following functions return date values as described:


   TRUNC( dts )                Time component set to 00:00:00.

   ROUND( dts )                Rounded to the nearest day.

   TRUNC( dts, 'MI' )          All seconds set to 00.

   ROUND( dts, 'MI' )          Rounded to the nearest minute.

   TRUNC( dts, 'HH' )          All minutes and seconds set to 00:00.

   ROUND( dts, 'HH' )          Rounded to the nearest hour.

   TRUNC( dts, 'MM' )          Dates adjusted to 1st day of its month.

   TRUNC( dts, 'Q' )           Dates adjusted to 1st day of its quarter.

   TRUNC( dts, 'YYYY' )        Dates adjusted to 1st day of its year.
   

   Note: this is not a complete list of all the ways you can use these functions.




As an example, you could use code similar to the following to count up the rows in your table grouping them by the hour in which the "dts" value fell:

/* --- code --- */

   select  trunc( dts, 'HH') as hour_buckets
        ,  count(*)
     from  sometable
    group  by trunc( dts, 'HH' )
   ;   



The Problem & The Solution:

The previous functions work quite well until you need to group up your date-time values by something like 10 minute buckets, or 6 hour buckets. For these requirements it's not quite so easy; but by using the formula given below, you can bucket your data by any number of minutes: 30 minutes, 2880 minutes (2days), or even 604,800 minutes (1 week).

Here's the basic formula.
 

   (date '1900-01-01' + floor(round(( dts - date '1900-01-01' )*1440/ m, 12))* m /1440)


  * Replace the "dts" with your date column or value.

  * Replace the two "m" values with the number of minutes 
    by which you wish to bucket the data.

  * The date (date '1900-01-01') is any arbitrary date, it
    just needs to be the same in both places in the formula.
    To keep things neat, it should also have a time
    component of 00:00:00. "Trunc(sysdate)" would work.

  * Note: 1440 is the number of minutes in a day, and the
    12 is just used to round the result to 12 decimal places.


The above formula will then give you the "dts" 
date values bucketed to "m" minute intervals.



Example:

Here's a WITH clause that will generate 10,000 random date-time values for the current week:

/* --- code --- */


   WITH
     w_random_dates  as
       ( select  trunc(sysdate,'IW') + 
                 (mod(abs(dbms_random.random),7)) + 
                 (mod(abs(dbms_random.random),86400)/60/60/24) as dts
           from  dual connect by level <= 10000
       )
   select  *
     from  w_random_dates
   ;




Now we can use the formula to modify the "dts" values into bucketed values. Run the following and observe how the dates are recalculated into 45 minute increments.

/* --- code --- */


   /* Include the WITH clause from the previous example here */
   select  dts
        ,  grpd_dts
     from  ( select  dts
                  ,  (date '1900-01-01' + floor(round(( dts - date '1900-01-01' )*1440/ 45, 12))* 45 /1440)  as grpd_dts  
               from  w_random_dates
           )
    order  by  1
   ;



Now observe how you can group by the bucketed date-time values and count up the number that fall into each bucket.

/* --- code --- */


   /* Include the WITH clause from the previous example here */
   select  grpd_dts
        ,  count(*)
     from  ( select  dts
                  ,  (date '1900-01-01' + floor(round(( dts - date '1900-01-01' )*1440/ 45, 12))* 45 /1440)  as grpd_dts  
               from  w_random_dates
           )
    group  by  grpd_dts
    order  by  1
   ;



Explanation:


Here's an example that attemptes to explain the logic of the formula. The WITH clause generates a date-time value for every minute of the current date; and it uses last night at midnight (time = 00:00:00) for the arbitrary date "arbdt". This example buckets up the date values into 8 minute intervals. The columns show each step in the formula's calculation with a brief description in the comment. Run this & examine the output.

/* --- code --- */


   WITH 
     w_each_minute_of_day  as
       ( select  trunc(sysdate)+((rownum-1)/1440)  as dts 
              ,  trunc(sysdate)  as arbdt
           from  dual connect by level <= 1440
       )
   select  to_char(dts, 'YYYY-Mon-DD HH24:MI') as "Date"
        ,                     (dts-arbdt)                      as a  /* Decimal part-of-day since midnight (arbitrary day).  */
        ,                     (dts-arbdt)*1440                 as b  /* # of minutes since midnight.  */
        ,                     (dts-arbdt)*1440/8               as c  /* # of 8-minute intervals since midnight.  */
        ,               round((dts-arbdt)*1440/8,12)           as d  /* fix the precision issue, round to 12 decimal places.  */
        ,         floor(round((dts-arbdt)*1440/8,12))          as e  /* Integral # of 8-minute intervals since midnight -- decimal part ignored.  */
        ,         floor(round((dts-arbdt)*1440/8,12))*8        as f  /* convert # of 8-minute intervals back to total minutes.  */
        ,         floor(round((dts-arbdt)*1440/8,12))*8/1440   as g  /* convert back to decimal part-of-day since midnight.  */
        ,  (arbdt+floor(round((dts-arbdt)*1440/8,12))*8/1440)  as h  /* add to midnight (arbitrary day) to get bucketed current dts.  */
     from  w_each_minute_of_day
   ;