Sqoop is a tool used to move bulk data from an RDBMS Database like MySQL, Oracle, Postgres to HDFS (or AWS S3).
Couple of issues I faced with Sqoop export are summarized below.
Timestamp Format Issue
HDFS/S3 record has date and time format ‘2015-03-03T08:28:47.484Z’. This value goes to a column on Oracle transaction_datetime with a data type timestamp(9).
Got an error from Sqoop.
Caused by: java.lang.RuntimeException: Can't parse input data: '2015-03-03T08:28:47.484Z'
Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
I could not find a direct solution with Sqoop. So what I did is to insert the data into Hive table by converting the date and time to a format expected by Oracle.
INSERT OVERWRITE TABLE hiveTable
from_unixtime(unix_timestamp(regexp_replace(transaction_datetime,'[TZ]',' ')), 'yyyy-MM-dd HH:mm:ss'),
Now the date stored on HDFS files is ‘2015-03-03 08:28:47’ that would work for Oracle.
Null Columns Issue
HDFS/Hive stores NULL values as ‘\N’. This would be exported to Oracle as is; and that does not look good on Oracle.
Adding some options to Sqoop utility would help recognize NULLs correctly. The full command is shown below.
sqoop export --connect jdbc:oracle:thin:@dbhost:1521/ORCL --username user --password password \
--table Table_In_Oracle \
--columns col1, col2 \
--export-dir 'src-directory' \
--input-fields-terminated-by ',' --input-lines-terminated-by "\n" \
--input-null-string '\\N' --input-null-non-string '\\N'
After running that command, now I see a value of ‘(null)’ for that column on Oracle.
That solves the problem little, but I’m hoping to see an empty value (”) in Oracle. Trying to find a solution…
Kill a Running Sqoop Job
Sometimes I start a Sqoop export job but wanted to kill it in the middle for some reason.
I write the above command in a shell script and run using nohup command.
nohup sqoop_export.sh > /tmp/job.out 2>&1 &
To kill it I get the process id using ‘ps -aef | grep nohup’ and kill using ‘kill <processid>’.
The job is killed but the problem? If you try to truncate the table on Oracle later, you get error like below. I tried to find the PID of that process on Oracle, but could not find what process locked it up, and I had to drop the table to quickly solve the problem.
ERROR at line 1:
ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired
The best way to kill Hadoop job is using ‘hadoop job’ command.
$ hadoop job -list
JobId State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info
job_1452010418661_0082 RUNNING 1452198342153 userid default NORMAL 9 0 15360M 0M 15360M ...
$ hadoop job -kill job_1452010418661_008
Killed job job_1452010418661_0082