Monday, August 1, 2011

Dates: Grouping date-time values into {n} minute buckets.

Sometimes it becomes desirable, in a query, to group up your date values into buckets of a given time-length. When creating charts or doing statistical analysis, you sometimes want, for instance, all the date-times to not include the minutes and seconds -- in other words, you want to bucket-up your values by hour. All the values from 3:00pm to 3:59:59pm should just be considered as 3:00pm values.

For some of the simple ones, the functions ROUND( ) and TRUNC( ) are very effective. For example, if you only want to look at your date-time values to the nearest hour you could group-by the ROUND(dts, 'HH' ). Or if you want to eliminate the time component of the dates and just use the date part you could group-by TRUNC(dts). The following are some of the common examples:

Assuming "dts" is your date column, 
the following functions return date values as described:


   TRUNC( dts )                Time component set to 00:00:00.

   ROUND( dts )                Rounded to the nearest day.

   TRUNC( dts, 'MI' )          All seconds set to 00.

   ROUND( dts, 'MI' )          Rounded to the nearest minute.

   TRUNC( dts, 'HH' )          All minutes and seconds set to 00:00.

   ROUND( dts, 'HH' )          Rounded to the nearest hour.

   TRUNC( dts, 'MM' )          Dates adjusted to 1st day of its month.

   TRUNC( dts, 'Q' )           Dates adjusted to 1st day of its quarter.

   TRUNC( dts, 'YYYY' )        Dates adjusted to 1st day of its year.
   

   Note: this is not a complete list of all the ways you can use these functions.




As an example, you could use code similar to the following to count up the rows in your table grouping them by the hour in which the "dts" value fell:

/* --- code --- */

   select  trunc( dts, 'HH') as hour_buckets
        ,  count(*)
     from  sometable
    group  by trunc( dts, 'HH' )
   ;   



The Problem & The Solution:

The previous functions work quite well until you need to group up your date-time values by something like 10 minute buckets, or 6 hour buckets. For these requirements it's not quite so easy; but by using the formula given below, you can bucket your data by any number of minutes: 30 minutes, 2880 minutes (2days), or even 604,800 minutes (1 week).

Here's the basic formula.
 

   (date '1900-01-01' + floor(round(( dts - date '1900-01-01' )*1440/ m, 12))* m /1440)


  * Replace the "dts" with your date column or value.

  * Replace the two "m" values with the number of minutes 
    by which you wish to bucket the data.

  * The date (date '1900-01-01') is any arbitrary date, it
    just needs to be the same in both places in the formula.
    To keep things neat, it should also have a time
    component of 00:00:00. "Trunc(sysdate)" would work.

  * Note: 1440 is the number of minutes in a day, and the
    12 is just used to round the result to 12 decimal places.


The above formula will then give you the "dts" 
date values bucketed to "m" minute intervals.



Example:

Here's a WITH clause that will generate 10,000 random date-time values for the current week:

/* --- code --- */


   WITH
     w_random_dates  as
       ( select  trunc(sysdate,'IW') + 
                 (mod(abs(dbms_random.random),7)) + 
                 (mod(abs(dbms_random.random),86400)/60/60/24) as dts
           from  dual connect by level <= 10000
       )
   select  *
     from  w_random_dates
   ;




Now we can use the formula to modify the "dts" values into bucketed values. Run the following and observe how the dates are recalculated into 45 minute increments.

/* --- code --- */


   /* Include the WITH clause from the previous example here */
   select  dts
        ,  grpd_dts
     from  ( select  dts
                  ,  (date '1900-01-01' + floor(round(( dts - date '1900-01-01' )*1440/ 45, 12))* 45 /1440)  as grpd_dts  
               from  w_random_dates
           )
    order  by  1
   ;



Now observe how you can group by the bucketed date-time values and count up the number that fall into each bucket.

/* --- code --- */


   /* Include the WITH clause from the previous example here */
   select  grpd_dts
        ,  count(*)
     from  ( select  dts
                  ,  (date '1900-01-01' + floor(round(( dts - date '1900-01-01' )*1440/ 45, 12))* 45 /1440)  as grpd_dts  
               from  w_random_dates
           )
    group  by  grpd_dts
    order  by  1
   ;



Explanation:


Here's an example that attemptes to explain the logic of the formula. The WITH clause generates a date-time value for every minute of the current date; and it uses last night at midnight (time = 00:00:00) for the arbitrary date "arbdt". This example buckets up the date values into 8 minute intervals. The columns show each step in the formula's calculation with a brief description in the comment. Run this & examine the output.

/* --- code --- */


   WITH 
     w_each_minute_of_day  as
       ( select  trunc(sysdate)+((rownum-1)/1440)  as dts 
              ,  trunc(sysdate)  as arbdt
           from  dual connect by level <= 1440
       )
   select  to_char(dts, 'YYYY-Mon-DD HH24:MI') as "Date"
        ,                     (dts-arbdt)                      as a  /* Decimal part-of-day since midnight (arbitrary day).  */
        ,                     (dts-arbdt)*1440                 as b  /* # of minutes since midnight.  */
        ,                     (dts-arbdt)*1440/8               as c  /* # of 8-minute intervals since midnight.  */
        ,               round((dts-arbdt)*1440/8,12)           as d  /* fix the precision issue, round to 12 decimal places.  */
        ,         floor(round((dts-arbdt)*1440/8,12))          as e  /* Integral # of 8-minute intervals since midnight -- decimal part ignored.  */
        ,         floor(round((dts-arbdt)*1440/8,12))*8        as f  /* convert # of 8-minute intervals back to total minutes.  */
        ,         floor(round((dts-arbdt)*1440/8,12))*8/1440   as g  /* convert back to decimal part-of-day since midnight.  */
        ,  (arbdt+floor(round((dts-arbdt)*1440/8,12))*8/1440)  as h  /* add to midnight (arbitrary day) to get bucketed current dts.  */
     from  w_each_minute_of_day
   ;
   

2 comments:

  1. Here's what I think is simpler "bucketizer"

    TRUNC( myDate, 'HH' ) + ( ( Round( ( TO_NUMBER( to_char( MyDate, 'MI' ) ) / mb ), 0 ) * mb ) / 1440 )

    mtDate is your DATE value or column. mb is the number of minutes that you want in each bucket.

    Moreover, this method would implement "Rounding" meaning that values will go into a lower or higher bucket depending on where they pass the rounding threshold. For example, if I want 15 minute buckets, 1:07 would go into the 1:00 bucket but, 1:08 would go into the 1:15 bucket.

    ReplyDelete
    Replies
    1. Interesting! I used the "floor" function -- by changing it to "round" in my solution, you can get the same rounding effect as in your solution. I guess it depends on how you want to do the bucketing. Thanks for pointing that out.

      The main problem with your solution (as I see it) is that if you choose to bucket by something like 8 minutes, you get mostly 8-minute-wide buckets with a 4-minute-wide bucket at the top of each hour. By only bucketing by values that divide evenly into an hour (2, 4, 6, 10, 12, 15, 20, 30) you can avoid that issue.

      Try the following -- it compares my original solution and yours. You can see that by bucketing by 8 minute intervals, your solution produces the hourly 4-minute-wide intervals. I've also changed my solution to "round" for comparison with yours.


      WITH
      w_param as ( select 8 as m from dual )
      , w_random_dates as
      ( select date '2000-01-01' + ((level-1)/1440 /* 1440 86400 */ ) as dts
      from dual connect by level <= 1000000
      )
      , w_compare as
      ( select dts
      , date'2000-01-01' + round(round((dts - date'2000-01-01')*1440/m, 12))/1440*m as mine
      , trunc(dts, 'HH') + round(to_number(to_char(dts, 'MI'))/m)*m/1440 as yours
      from w_random_dates join w_param on 1=1
      )
      select dts
      , mine
      , yours
      , decode( mine, yours, 'ok', null ) as flg
      from w_compare
      where 1=1
      --and mine != yours
      ;

      My solution buckets to a consistent 8-minute-wide buckets across all hours and days. This is also true even with buckets like 7 or 11 minutes.

      Thanks for the input.

      Delete