Linux | My Big Data Blog

I came across a scenario where in I had to parse a file that has many INSERT hql commands in a single file and I had to create a separate file for each INSERT command.

My input file insert_all.hql is as below.

INSERT INTO TABLE table1
SELECT
 col1,
 col2,
 col3
FROM table11
WHERE date = '2017-07-12'
;

INSERT INTO TABLE table2
SELECT
 col1,
 col2,
 col3
FROM table22
WHERE date = '2017-07-12'
;

INSERT INTO TABLE table3
SELECT
 col1,
 col2,
 col3
FROM table33
WHERE date = '2017-07-12'

You can do this manually by creating each file and copy the required code from original file to target file for each table. But it’s error-prone and tedious.

I wrote a little script using Linux/Unix tool called awk. awk is very good text processing tool that can be used for this kind of text parsing tasks.

awk '
> BEGIN {RS=";"}
> { content=$0;
> if(content ~ /INSERT INTO TABLE ([A-Za-z].*)/) filename=$4;
> print content > filename;
> close filename;
> }
>' insert_all.hql

Explanation

In the BEGIN section, RS (Record separator) is initialized as semicolon. The default is newline, but this doesn’t work for us here.
Next section, I am copying the content of that record to a variable.
Next, I am extracting table name using REGEX. filename is table name
Next, I am copying the content into the file
This will generate files like table1, table2, etc.

You can run this code from a terminal on Mac or a Linux machine.

	Hadoop training in H… on Cloudera Hadoop Developer…
	putnam120 on S3 and Data Transfer issue fro…
	Test Bank Mathematic… on AWS Solutions Architect…
	yeskay on Some Hadoop Questions
	mathivanan on Some Hadoop Questions

My Big Data Blog

My notes on Hadoop, Cloud, and other BigData technologies

Tag Archives: Linux

Text Processing using AWK

Explanation