Tuesday, August 16, 2011

Using SQL-Loader to Load CSV Data Into a Table.

Given a CSV (comma separated values) file of data, it's actually not that hard to use Oracle's "SQL-Loader" to insert it all into a table. The use of "SQL-Loader" requires that you have an Oracle Client properly installed on your machine.

Here's a small set of CSV data -- intentionally made just a little bit messy. Note that some of the fields are enclosed by quote marks and others are not; if a string has a comma within it (such as some of the names below) then it MUST be enclosed by quotes. Notice that the first row of data is actually the list of column names. And, take note of the date format.

If you want to run this demo yourself copy the data and other files below into one single folder on your machine as indicated. You can supply directory paths along with the file-names, but putting everything in one folder eliminates the need for that & makes things a little simpler. This demo was executed on a Windows machine, but it works very much the same on a Unix machine (of course you would use a shell script file instead of a batch file).

Create this file as "SourceData.csv":
ID_VAL, JUNK, NUM_VAL, PERSON_NM, THE_DATE
    8  ,yy, 37.251  , "Joe Jones"   , "2009-08-18 13:27:33" 
   ""  ,  zzz  ,  6.471  ,  Sam Smith    , "2010-12-30 16:13:39" 
      ,        ,        ,        ,        
   10  ,1,  0      ,               ,                       
   19  ,2, "0.25"  ,  Bob Benson   , "9999-99-99 99:99:99" 
  "47" , 3 , 13.23   ,  ","          ,  2007-05-02 22:58:02  
  106  , , 44.732  , "Liz Loyd"    , ""                    
 2249  ,"",   ""    , "Nelson, Ned" ,  1997-06-05 04:42:26  
"32154", ,0.001,"fizwig, Ray, M","1913-02-04 18:01:00"
"",abc,"42.1234",Sam b smiTh,"2010-12-30 16:13:39"
"12",~,2.4,",",2007-05-02 22:58:02
,,,,
98, ",",".9732","jAmes, jAn",""



We will need a table in which to load the data. Note that the column names do not need to match the names given in the CSV file. Execute this command in a SQL session -- you don't need to put this in a file on your machine.
/* ===== code ===== */
 
     Create table scott.loader_demo
       ( id_num      integer
       , some_dts    date
       , name_txt    varchar2(20)
       , extra_col   varchar2(5)
       , val_num     number
       ) ;



Next we will need a control file -- this is the heart of using "SQL-Loader". It defines where the data is coming from and where it is going; it contains all the details of how the load should proceed. In this example we see how to do a basic load of simple data-types. The following control file contains lots of comments -- all these, of course, can be stripped out.

Create this file as "ControlFile.ctl":
------------------------------------------------------------
-- SQL-Loader Control File
------------------------------------------------------------
OPTIONS
   ( skip=1                                    -- Note {op1}
   , errors=10                                 -- Note {op2}
-- , load=100                                  -- Note {op3}
   , direct=true                               -- Note {op4}
   , rows=10000                                -- Note {op5}
-- , bindsize=67108864                         -- Note {op6}
-- , readsize=67108864                         -- Note {op7}
   )
------------------------------------------------------------
UNRECOVERABLE                                  -- Note {L1}
LOAD DATA                                      -- Note {L2}
   INFILE        'SourceData.csv'              -- Note {L3}
   BADFILE       'LoadResults_BAD.log'         -- Note {L4}
   DISCARDFILE   'LoadResults_DISCARDED.log'   -- Note {L5}
   APPEND                                      -- Note {L6}
   INTO TABLE     scott.loader_demo            -- Note {L7}
-- WHEN           id_num <> ''                 -- Note {L8}
-------------------------------------------------------
Fields Terminated    by ","                    -- Note {x1}
Optionally Enclosed  by '"'                    -- Note {x2}
Trailing Nullcols                              -- Note {x3}
------------------------------------------------------------
                                               -- Note {c1}
   ( id_num
   , junk_01     FILLER                        -- Note {c2}
   , val_num
   , name_txt   "initcap(trim(:name_txt))"     -- Note {c3}
   , some_dts   DATE "YYYY-MM-DD HH24:MI:SS"   -- Note {c4}
   )
------------------------------------------------------------
-- {op1}: Skip the first <n> header rows (column names).
-- {op2}: Allow <n> errors before aborting the load.
-- {op3}: Load only this many rows, then stop.
-- {op4}: Direct=true can be much faster; data-blocks
--          are directly built onto the table.
--        Direct=false does conventional inserts;
--          Indexes updated as each row is inserted.
--        Indexes and Constraints affect the
--          performance of either load type.
-- {op5}: Load <n> rows between each commit.
-- {op6}: For conventional loads, setting this to a
--          large value can increase performance.
-- {op7}: Works in conjunction with 'bindsize'.
---------
-- {L1}:  Can speed up the load. In case of database
--          fail during load, loaded data is be lost.
-- {L2}:  Section to specify files and tables
-- {L3}:  The name of the CSV source data file.
-- {L4}:  Error Records will be written to this file.
-- {L5}:  Valid rows not loaded, written to this file.
--          (All-NULLs or rows filtered out by a WHEN)
-- {L6}:  Table load method, choose one of the following:
--          INSERT, APPEND, REPLACE, or TRUNCATE.
-- {L7}:  The name of the destination Oracle table.
-- {L8}:  Load only records that meet a condition.
---------
-- {x1}:  The field terminator ( usually <,> or <|> ).
-- {x2}:  The quote character ( usually <"> ).
-- {x3}:  Missing fields at end of record = NULL.
---------
-- {c1}:  This section is the column list. The number
--          and order of the columns should match your
--          CSV data, not necessarily the target table.
--        The column names in this list should be the
--          column names of your target table.
-- {c2}:  Use key word FILLER to ignore a column of CSV
--          data. The name in this list does not matter.
-- {c3}:  Example using functions on CSV values.
--          Use quotes around functions as shown.
-- {c4}:  Use key word DATE to convert CSV date values.
--          Format string in ("") must match CSV data.
------------------------------------------------------------


A further explanation of the control file is given at the end of this article.

The last thing we need to do is to execute the load -- the command can be issued from a command prompt; but in this example we put the command in a batch file. In Unix you could put the command in a shell-script.

In single quotes, you'll need to supply the proper 'user-name / password @ database-identifier' the last 2 elements point the load at your newly created control file and define the results log file. Now if you have a properly installed Oracle-Client, double-clicking on this batch file will execute the load.

If you don't want to put the password in the batch file just leave it out (as well as the preceding slash) and it will ask you for the password at run-time (i.e. 'scott @ DBSID' ).

Create this file as "GoLoad.bat": (Edit the Log-In information)
@echo off
echo.
sqlldr  'scott / tiger @ DBSID'  control='ControlFile.ctl'   log='LoadResults.log'
pause



When the load finishes, you can observe that the specified data has been inserted into your table:



Don't forget to examine the three 'LoadResults' log files. They contain information about the load process. You will be able to see which records didn't load & why.





Another interesting detail is that you can combine your data and the control file as shown in the following example. Here the "INFILE" is specified as "*" -- and you need to include the "BEGINDATA" just prior to the lines of CSV data. This would be executed in the same way the previous example was executed.

----------------------------------------------------------------------
-- SQL-Loader: A Control File that Includes the DATA.
----------------------------------------------------------------------
OPTIONS 
   ( skip=1
   , rows=10000
   , errors=10 
   )
----------------------------------------------------------------------
LOAD DATA
   INFILE        *
   BADFILE       'LoadResults_BAD.log'            
   DISCARDFILE   'LoadResults_DISCARDED.log'      
   APPEND                                         
   INTO TABLE     scott.loader_demo                               
----------------------------------------------------------------------
Fields Terminated    by ","                               
Optionally Enclosed  by '"'                               
Trailing Nullcols                                       
----------------------------------------------------------------------
   ( id_num
   , junk_01    FILLER
   , val_num  
   , name_txt   "initcap(trim(:name_txt))" 
   , some_dts   DATE "YYYY-Mon-DD HH24:MI" 
   )
----------------------------------------------------------------------
BEGINDATA
"ID_NUM","junk","A_VALUE","FULL_NAME","THE_DATE"
8,"a",37.251,"Joe jones","2009-Aug-18 13:27"
6,b,23.9,Ray Rulon,1984-Jan-23 02:37 
13  ,  c  ,  0.2  ,  Elie Ellison  , 1998-Feb-13 13:28
,d,6.471,sam SmItH,2010-Dec-30 16:13
47,"e",13.23,"","2007-May-02 22:58"    
993,"bad-date",274.51,"Bad, Bob","01/01/2001"
421,"f",24.5,"   Sanders, slY    ","2003-Sep-12 23:59" 
2,"g",1432.8,"Jan Jepson","2005-Jul-05 16:02" 
106,"h","44.732","Liz Loyd",""               
,,,,
1,,,,
27,"i",1.567,"bOb bEnnie bEnson","2000-Mar-30 10:44" 
7,,22.21,"Mel morris","1978-Apr-01 04:10" 
2249,"j",,"Nelson, ned","1997-Jun-05 04:42"
0,"",0,"Empty",






Additional notes on the control file:

The first section of the control-file is the "OPTIONS" -- here you set a few details of the load process. The "skip" option allows you to skip as many rows as you wish at the beginning of the CSV file. Normally it is used to skip one row of column names if they exist. If there are no column names, just specify "skip=0".

The "rows" option allows you to specify how many rows will be loaded between each "Commit". For most circumstances, something like "rows=10000" would work just fine.

The "errors" option allows you to specify how many errors will be tolerated before the whole load process is aborted. If your data is very bad and you just want to load the rows that will load, you could set this to a high value.

The "load" parameter is usually included while testing a load; it tells the process to load a specific number of rows and then quit -- even if the CSV file has many more rows of data.

The next section "LOAD DATA" is where you define the file-names (in single quotes) and the destination table name (not in quotes).

The "INFILE" is your CSV data file that contains the records you wish to load.

The "BADFILE" will contain all the records from the load process that 'errored-off' for some reason.

The "DISCARDFILE" will contain all the records that didn't load because all the columns were NULL or were disqualified by the "WHEN" clause.

The keyword "APPEND" will insert the new records into the table leaving existing records in the table. If you specify "TRUNCATE" instead of "APPEND" then the table will be truncated prior to loading the new records.

You can include an optional "WHEN" clause with a condition that controls which CSV records will be loaded. Some possible examples include: id_num <> '' , id_num = '8' . Remember that this is not a "WHERE" clause and you can't use Oracle functions here -- it is actually quite limited.

The next section defines the delimiter (usually a comma) and the quote marks that can be used to enclose a value. "Trailing Nullcols" tells the loader to interpret missing values at the end of records as NULLs.

The last and very important section of the control file is the column list. Make sure that the order and number of columns match your CSV data -- and that the column names match your destination table (except for "FILLER" columns). So, the position in your control-list tells SQL-Loader which column in the CSV file to get the value from, and its name in your control-list tells SQL-Loader which column in your destination table the value will be placed in.

If you have a CSV column that you wish to ignore, give it any name (junk, dummy, etc.) and put the keyword "FILLER" immediately after. This tells SQL-Loader to ignore that CSV column of data.

Notice that you do not need to insert data into every column in your destination table. Table columns not listed in the control-list will be left NULL.

Date values should be followed with the "DATE" keyword and the appropriate date format string. CSV records that have dates not matching this string will be rejected.

It is also possible to manipulate the data during the load process. Notice that you can use Oracle functions on the values by enclosing the expression in double quotes as shown. Make sure you repeat the column name preceded by a colon inside the function expression.


A few more possibilities in manipulating the data:

 
 Review & Going further... 
 
 In your column list, remember that the number and position of the 
 entries should match your CSV data file.
                                                      
    ( CSV_Col_1    
    , CSV_Col_2    
    , CSV_Col_3  
    , CSV_Col_4  
    , CSV_Col_5  
    )  
 
 Mark the CSV columns that you won't use with the "FILLER" keyword.
 The column names of unused filler data do not matter.
    
    ( CSV_Col_1   FILLER 
    , CSV_Col_2   FILLER 
    , CSV_Col_3  
    , CSV_Col_4  
    , CSV_Col_5  
    )
 
 The columns that you will use need to be labeled with the name of the
 table column where they will be loaded. 
 
    ( CSV_Col_1   FILLER 
    , CSV_Col_2   FILLER 
    , val_num  
    , name_txt  
    , some_dts  
    )
 
 
 Manipulating the CSV Values! 
 
 You can then add modifiers to alter the data. DATE values will
 need to have the format string included after the "DATE" keyword.
 To use incoming CSV values in an expression with Oracle functions,
 you refer to them with the name of their destination column as a 
 bind variable (ex ":name_txt"). With that as the column-name you
 then use Oracle functions and enclose the whole expression in 
 double-quotes as shown on the name_txt line.
 
    ( CSV_Col_1   FILLER 
    , CSV_Col_2   FILLER 
    , val_num  
    , name_txt    "initcap(trim( :name_txt ))"
    , some_dts    DATE "YYYY-MM-DD HH24:MI:SS" 
    )
 
 
 Date Manipulation!
 
 There are a couple of extra things you can do with a date column.
 Instead of using the DATE keyword, you can use the "to_date" function.
 You can even add more functions to manipulate the date as shown on
 the third row below (this would truncate the date and set it to the 
 last day of the previous month). Notice the fourth row shows how to
 assign the SYSDATE to a date column -- incoming CSV values are ignored.
    
    , some_dts    DATE "YYYY-MM-DD HH24:MI:SS" 
    , some_dts    "to_date( :some_dts ,'YYYY-MM-DD HH24:MI:SS')" 
    , some_dts    "trunc(to_date( :some_dts ,'YYYY-MM-DD HH24:MI:SS'),'MM')-1" 
    , some_dts    "sysdate"
   

 Using Constants!

 You can also put a constant number in a numeric field or a constant
 string value in a Varchar2 field -- ignoring the CSV values.   
 Enclose numeric constants in double-quotes; enclose string constants
 in single-quotes then double-quotes.   
 
    ( CSV_Col_1   FILLER 
    , CSV_Col_2   FILLER 
    , val_num     "1234"
    , name_txt    "'Bill Bagnnn'"
    , some_dts    DATE "YYYY-MM-DD HH24:MI:SS" 
    )

 Adding a Sequence!
 
 If you have an extra column in your table where you would like to put
 a sequence number, then put that column name at the end of your list
 and the keyword "SEQUENCE". Actually it appears that you can put it 
 anywhere in your list: first, in-between, or last. It also does not have
 to be a numeric column. The two values in parenthesis are the starting
 sequence number and then the increment value. If you are appending to
 a table instead of truncate-and-loading, then you can use "SEQUENCE(MAX,1)".
 That will find the maximum sequence value already in your table an then
 increment from there.
 
    ( id_num      
    , CSV_Col_2   FILLER 
    , val_num  
    , name_txt    "initcap(trim( :name_txt ))"
    , some_dts    DATE "YYYY-MM-DD HH24:MI:SS" 
    , extra_col   SEQUENCE(1,1)
    )

 That's it!


5 comments:

  1. Thank you. This page is a thing of beauty.

    ReplyDelete
  2. Hi Folks,

    I have a data file fields seperated by comma (,) and enclosed by quotations. But inside the quoted
    string there are again commas.

    Example

    "A Street","T/A, slow","123 4567 8901"

    How to load such data. Please help.

    control file looks likes this.

    LOAD DATA
    APPEND
    INTO TABLE CM_A
    FIELDS TERMINATED BY ","
    OPTIONALLY ENCLOSED BY '"'
    TRAILING NULLCOLS
    (
    STREET CHAR "RTRIM(:STREET)",
    NAME CHAR "RTRIM(:NAME)",
    PHONE CHAR "RTRIM(:PHONE)"
    )

    Please help

    ReplyDelete
  3. Notice in the example in my post... there is a comma in a couple of the name values. The way I've shown it deals with these commas appropriately.

    ReplyDelete
  4. This was a great help!, thanks a lot!

    ReplyDelete
  5. This is a great example with complete description.Thanks a lot steve.Very happy to learn from blog.Please post many topics like this with complete example .Its awesome .Thanks a lot.

    ReplyDelete