Thursday, January 10, 2013

CSV Data: Using a SQL query to dump data directly to a CSV file.

Why do so many people want to dump data into Excel?

The common tools like TOAD or SQL-Developer have an option to dump their query results directly into Excel or a CSV file (or various other formats). It seems to work well for relatively small data sets, but when it gets to be hundreds of thousands or millions of rows, it seems like people regularly have problems making it work. The following is a solution that I've given out and trusted in these situations.

Basically, we kick off a script via SQLPlus that runs the query and directly spools the output to a CSV file. The script below is an example of how to make this work. Once you have a query that returns the data you desire past it in as the source query at the indicated place in the script.

Notice that this script will spool or write the results into a file named "Unloaded-Data.csv". Change this to any desired file name.

You have the option of writing a "Headings" row at the first of the file -- edit that part to include all the column names and make sure that you enclose them both in single and double quotes as shown. If you don't want headings that piece can be deleted.

In your source query it's a good idea to explicitly format dates with the "TO_CHAR( )" function. Use a format that Excel will readily recognize ( the one that I prefer is 'YYYY-MM-DD HH24:MI:SS' even though different ones are used in the example below). Also, for testing you may wish to include a row limiter in your where clause "and rownum < 50" -- this way you can test the CSV creation process without waiting for millions of rows to be written to the file.

As you can see at the end of the query, it produces a CSV record by concatenating quotes around and commas between the column values of your query. Edit this to include all the columns, in the order you prefer, from the source query that you just pasted in.

prompt ********** UnLoading Data **********

WHENEVER SQLERROR EXIT SQL.SQLCODE

-------------------------------------------------------------------------------------
--  pagesize     0  =  Turns off the default of splitting the results into sections.
--  heading    off  =  Turns off column headings.
--  feedback   off  =  Turns off any feedback. ( ie: "20 Rows Returned" )
--  trimspool   on  =  Trims off extra spaces at the end of each line.
--  linesize 32767  =  Makes sure that a line (row of data) can be really long.
--  termout    off  =  Do not show the data in the SQL-Plus window (much faster).
--  verify     off  =  Do not show the substitution variable replacement values.
-------------------------------------------------------------------------------------
   set pagesize        0
   set heading       off
   set feedback      off
   set trimspool      on
   set linesize    32767
   set termout       off
   set verify        off



--------------------------------------------------
-- Unload the data to the following CSV file.
--------------------------------------------------
   spool  Unloaded-Data.csv

--------------------------------------------------
-- Write a column "Headers" row to the CSV file.
--------------------------------------------------
   
   ----------------------------------------------------------------------------
   -- List all the column headers that you wish to put in the CSV file. 
   -- They do NOT need to be the same as the column values from the query. 
   -- Make sure each one is enclosed by single and double quotes.
   ----------------------------------------------------------------------------
   select   '"FULL_NAME_"'    
   ||','||  '"PROCEDURE_DATE"'
   ||','||  '"PROCEDURE_ID"'  
   ||','||  '"ENCOUNTER_ID"'  
   ||','||  '"PROCEDURE_DSC"' 
   ||','||  '"ROOM_ID"'       
   ||','||  '"HEIGHT"'        
   ||','||  '"CHARGE_AMT"'    
   ||','||  '"BIRTHDAY"'      
   ||','||  '"BLOOD_TYPE"'    
   ||','||  '"PHONE"'         
   ||','||  '"INSURANCE_ID"'  
   -------
     as  optional_header_row
   from  dual ;

--------------------------------------------------
-- The source Query in a WITH clause and
-- one CSV formatted column of all the values.
--------------------------------------------------
   with  source_query  as
   ( ----------------------------------------------------------------------------
     -- Source Query goes here. 
     -- This selects and formats the data as you want in the CSV file.
     ----------------------------------------------------------------------------
     select  initcap(p.last_nm)||', '||initcap(p.first_nm)  as full_name_str
          ,  to_char(c.procedure_dts,'YYYY-MM-DD HH:MI am') as procedure_date_str
          ,  c.procedure_id                                 as procedure_id_no
          ,  c.encounter_id                                 as encounter_id_no
          ,  c.procedure_dsc                                as procedure_dsc_str      
          ,  e.room_id                                      as room_id_str
          ,  e.pt_height_inches_val                         as height_no
          ,  e.total_charge_amt                             as charge_amt_no
          ,  to_char(p.birth_dt,'MM/DD/YYYY')               as birthday_str
          ,  p.blood_type_cd                                as blood_type_str
          ,  p.phone_txt                                    as phone_str
          ,  p.insurance_id                                 as insurance_id_no 
       from  train.xyz_procedure  c
             left join
             train.xyz_encounter  e  on ( c.encounter_id = e.encounter_id )
             left join
             train.xyz_patient    p  on ( e.patient_id = p.patient_id )
      where  nvl( e.total_charge_amt, 0 ) < 1000
      order  by  p.last_nm, p.first_nm, c.procedure_dts
     ----------------------------------------------------------------------------
       )
   ----------------------------------------------------------------
   -- List all the columns that you want to include. It encloses 
   -- each value in double quotes and separates them with commas.
   ----------------------------------------------------------------
   select '"'||    full_name_str             ||'"'
   ||','||'"'||    procedure_date_str        ||'"'
   ||','||'"'||    procedure_id_no           ||'"'
   ||','||'"'||    encounter_id_no           ||'"'
   ||','||'"'||    procedure_dsc_str         ||'"'
   ||','||'"'||    room_id_str               ||'"'
   ||','||'"'||    height_no                 ||'"'
   ||','||'"'||    charge_amt_no             ||'"'
   ||','||'"'||    birthday_str              ||'"'
   ||','||'"'||    blood_type_str            ||'"'
   ||','||'"'||    phone_str                 ||'"'
   ||','||'"'||    insurance_id_no           ||'"'  
   ------------ 
     as  csv_data_row
   from  source_query ;



--------------------------------------------------
-- Close the CSV file and terminate the script.
--------------------------------------------------
spool off

set termout on
prompt *********** Finished CSV ***********

exit


Now all you need to do is run the script. One easy way to do this is to create a folder and put two scripts there. For this example place the previous script in the folder and name it "CSVunload.sql". Create another script called "Go.bat" and put into it the following contents. The one important line of this script is the one that starts with "sqlplus". Edit this line an put in your appropriate 'username' and 'dbsid' (database identifier). Make sure the script name after the second "@" matches your previous SQL script name. Now, just double-click on the "Go.bat" file and enter your password (it pauses to ask you for that).

@echo off
echo =============================
echo ==  Unloading Data to CSV  ==
echo =============================
echo.

sqlplus -s   "username @ dbsid"   @CSVunload.sql

SET myerror=%errorlevel%
IF /I '%myerror%'=='0' GOTO TagOK
echo.
echo *=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
echo *=*  ERROR !  The Dump to CSV Failed !  *=*
echo *=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
echo.
GOTO TagAllDone
:TagOK
echo.
echo Dump to CSV completed OK.
echo.
:TagAllDone

pause


Remember that millions of rows may take a while to download and write to the file. When it finishes it should say that it completed OK. Excel by default will open a CSV file (it will look just like a normal Excel file) -- then if you desire you can 'save-as' an Excel type file. Even though I like CSV files much better.

-----

Here is one more example that will stop and prompt the user for three parameters to use in the query. It asks for an owner search string, a date, and a number. Then in the query it shows you how to use those 3 parameters to limit the result set. Note that when you prompt for a parameter "s_date" then in the query you use it with 2 ampersands and a period: "&&s_date."

These substitution variables work by replacing the "&&s_date." in the query with the exact text they contain before the query is submitted to the Oracle engine.

This script is provided without all the comments -- also note that the main query doesn't use a "WITH" clause -- either way should work fine.

prompt
accept  s_own_srch   prompt 'Enter which owners (search string): '

prompt
accept  s_anlz_dt    prompt 'Enter begin date for Last-Analyzed YYYY-MM-DD: '

prompt 
accept  s_row_ct     prompt 'Enter the MIN row count: '

prompt
prompt  ********** UnLoading Data **********
prompt

WHENEVER SQLERROR EXIT SQL.SQLCODE

   set pagesize        0
   set heading       off
   set feedback      off
   set trimspool      on
   set linesize    32767
   set termout       off
   set verify        off
   spool  Unloaded-Data.csv


   select   '"owner"'
   ||','||  '"table_name"'
   ||','||  '"tablespace_name"'
   ||','||  '"last_analyzed"'
   ||','||  '"num_rows"'
   -------
     as  optional_header_row
   from  dual ;


   select '"'||    owner             ||'"'
   ||','||'"'||    table_name        ||'"'
   ||','||'"'||    tablespace_name   ||'"'
   ||','||'"'||    last_analyzed     ||'"'
   ||','||'"'||    num_rows          ||'"'
   ------------ 
     as  csv_data_row
   from  ( -----------------------------------------
           -- Source Query Here
           -----------------------------------------
           select  owner
                ,  table_name
                ,  tablespace_name
                ,  to_char
                     ( last_analyzed
                     , 'YYYY-MM-DD HH24:MI:SS'
                     )  as last_analyzed
                ,  num_rows
             from  all_tables
            where  owner like upper('%'|| '&&s_own_srch.' ||'%')
              and  last_analyzed > to_date('&&s_anlz_dt.', 'YYYY-MM-DD')
              and  num_rows      > &&s_row_ct.
            order  by  owner
                    ,  table_name   
           -----------------------------------------
         )
   ;
   

spool off

set termout on

prompt *********** Finished CSV ***********

exit


Enjoy your day(ta)!

Friday, January 4, 2013

WITH Clause: Sub-queries and Sub-query Factoring (Great for sample or test data!)

The Sub-Query


One of the features of SQL that really makes it sing is its ability to query a query -- or in other words: to make use of sub-queries. I've seen lots of queries with nested sub-queries that could actually be written as a single query with no sub-queries -- here's a quick example:

-----> Example SQL Code Listing-#1:


   select  patient_id, order_id
     from  ( select  patient_id, order_id
               from  ( select  patient_id, admit_dt, order_id
                         from  ( select  patient_id, admit_dt, facility, order_id
                                   from  inpatient_admit
                                  where  diagnosis_cd in ( 257, 386, 1074, 2793 )
                               ) 
                        where  facility = 1389       
                     )
              where  admit_dt >= date '2008-01-01'
                and  admit_dt <  date '2009-01-01'
           )
    where  order_id is not NULL
      and  patient_id between 580000 
                          and 589999
    order  by  patient_id, order_id ;
   
------------------------
-- Written better as:
------------------------
   
   select  patient_id, order_id
     from  inpatient_admit
    where  diagnosis_cd in ( 257, 386, 1074, 2793 )
      and  facility = 1389
      and  admit_dt >= date '2008-01-01'
      and  admit_dt <  date '2009-01-01'
      and  order_id is not NULL
      and  patient_id between 580000 
                          and 589999
    order  by  patient_id, order_id ;



In cases like this, there's no need for a sub-query (or nested sub-queries 3 levels deep). But, in many queries they become very useful. They can be used just to make the query more understandable, or they can be completely necessary to the functionality of a query.

The WITH Clause


One of the nicest uses of sub-queries is in the WITH clause. This is also referred to as "Sub-Query Factoring". The WITH clause allows you to take a sub-query and move it, out of the way, to the top of your query. There you give it a name and refer to it, in the main query, just as you would refer to a table. This can greatly improve the readability of a query but where a sub-query is used multiple times in a main query, this can improve performance -- the WITH clause query runs once and is stored in a temporary table. It is then used and reused in the main query as needed.

Here's an example of a query that uses the same sub-query twice. It is first used to pull a cohort of records from the "visit" table. It is then used a second time to insure that the "recommended_by" person of each selected record is not some member of the cohort. In the second version of the query, the sub-query is "factored out" into the WITH clause and given the name of "cohort". That WITH-Clause sub-query is then used twice in the main query. The result is a simpler, neater, main query that would only need to run the sub-query one time.

-----> Example SQL Code Listing-#2:


   select  v.start_dt
        ,  v.person_id
        ,  v.person_name
        ,  v.description
        ,  v.recommended_by
     from  visit  v
     join  ( select  person_id
               from  population
              where  category_flg = 3
                and  birth_dt between date '1970-01-01'
                                  and date '1970-12-31'
                and  gender = 'M'
           )  c  on ( v.person_id = c.person_id )
    where  v.recommended_by not in
             ( select  person_id
                 from  population
                where  category_flg = 3
                  and  birth_dt between date '1970-01-01'
                                    and date '1970-12-31'
                  and  gender = 'M'
             )
      and  v.active = 'Y' 
    order  by  v.person_id ;
   
   
------------------------
-- Using the WITH Clause
------------------------
   
   WITH  cohort  as
     ( select  person_id
         from  population
        where  category_flg = 3
          and  birth_dt between date '1970-01-01'
                            and date '1970-12-31'
          and  gender = 'M'
     )
   select  v.start_dt
        ,  v.person_id
        ,  v.person_name
        ,  v.description
        ,  v.recommended_by
     from  visit   v
     join  cohort  c  on ( v.person_id = c.person_id )
    where  v.recommended_by not in
             ( select person_id from cohort )
      and  v.active = 'Y' 
    order  by  v.person_id ;
   



Sample or Test data.


One nice use of sub-queries is that you can quickly create a sample data set that will allow you to try out or learn how to use a particular function or SQL technique. In this case, I've created a 5 row "table" so that I can figure out how to use a simple "group by" aggregate query and the "sum( )" function.

Notice that in the FROM clause I select each row of sample data from dual and use the "Union" to tack them all together. I need to have "column names" assigned to the invented columns, but I only need to include them on the  1st row.  These 5 rows of data enclosed by parentheses act just like a table in the query. I can quickly change any value or add new rows to see how it affects my query logic.

-----> Example SQL Code Listing-#3:


   select  usr_type
        ,  sum(credit_amt)  as total_group_credit
     from  (       select  12 as user_id, date '2010-04-28' as start_dt, 'A' as usr_type, 234.76 as credit_amt  from dual
             union select  13           , date '2011-11-11'            , 'B'            ,  78.82                from dual
             union select  15           , date '2004-07-24'            , 'A'            , 138.12                from dual
             union select  16           , date '2012-02-09'            , 'C'            ,  24.47                from dual
             union select  17           , date '2009-12-15'            , 'C'            , 195.08                from dual
           ) 
    group  by  usr_type
    order  by  usr_type ;



In reality, I rarely put sample or test data in the FROM clause. I put it at the beginning in the WITH clause -- and the query then looks like a normal query pulling data from a named table of data:

-----> Example SQL Code Listing-#4:


   with sample_data as
     (       select  12 as user_id, date '2010-04-28' as start_dt, 'A' as usr_type, 234.76 as credit_amt  from dual
       union select  13           , date '2011-11-11'            , 'B'            ,  78.82                from dual
       union select  15           , date '2004-07-24'            , 'A'            , 138.12                from dual
       union select  16           , date '2012-02-09'            , 'C'            ,  24.47                from dual
       union select  17           , date '2009-12-15'            , 'C'            , 195.08                from dual
     ) 
   select  usr_type
        ,  sum(credit_amt)  as total_group_credit
     from  sample_data
    group  by  usr_type
    order  by  usr_type ;




As a final touch, when I'm creating examples or playing with SQL logic, I often put all the column names at the top in the first row of data having NULL values. The "where 1=0" insures that the NULL values row doesn't show up in my sample data (but the columns now all have names), and the actual rows of data are all 'UNIONed' in below.

-----> Example SQL Code Listing-#5:


   with  
     sample_data  as
       ( select  NULL  as user_id
              ,  NULL  as start_dt
              ,  NULL  as usr_type
              ,  NULL  as credit_amt    from dual  where 1=0
         ------------
         union select  12, date '2010-04-28', 'A', 234.76  from dual
         union select  13, date '2011-11-11', 'B',  78.82  from dual
         union select  15, date '2004-07-24', 'A', 138.12  from dual
         union select  16, date '2012-02-09', 'C',  24.47  from dual
         union select  17, date '2009-12-15', 'C', 195.08  from dual
         ------------
       ) 
   select  *
     from  sample_data ;



Running the above data-sample and query results in output that looks like it came from a table:










Sub-queries are great. If you can put them in the WITH clause they usually look much nicer. And the WITH clause is a great way to create simple test data or share SQL examples with others.