Hive UDF is a User Defined Function that can be applied to any input field on Hive table. It’s generally used to write a customized function.
Here I am trying to replace newline character (\n) in values of a column.
High Level Steps using Eclipse IDE (Mars 4.5 version)
- Write a Java class and add a method called evaluate()
package com.mycompany;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class StripNewline extends UDF {
private Text result = new Text();
public Text evaluate(Text s) {
try {
String rep = s.toString().replaceAll("\\n", "");
result.set(rep);
} catch (Exception e) {
result.set("Exception");
}
return result;
}
}
- Add Dependent external jars
- To compile the class, add the required JARS:
- Download hadoop-core.1.1.2.jar from Maven Repository
- Get hive-exec-0.13.1-amzn-1.jar from /home/hadoop/hive/lib on EMR EC2 machine
- Add these jars on Eclipse IDE using: Right click Project -> Configure Build Path ->Libraries -> Add External JARs (Navigate to where you downloaded)
- Compile and Export as jar
- Right click project -> Export as Java JAR -> Choose a file name for jar
- Copy the jar to EMR EC2 master node
- Using ‘scp’ command from a Linux Terminal or FTP to master node
- Add jar to Hive shell and use the function in Hive query
hive> ADD JAR /home/hadoop/myjar.jar;
hive> create temporary function repl as 'com.mycompany.StripNewline';
hive> select repl(colName) from tableName;
Possible Errors
While adding JAR on Hive CLI prompt, you may get the following error.
hive> create temporary function repl as 'com.mycompany.StripNewline';
java.lang.UnsupportedClassVersionError: com/mycompany/StripNewline : Unsupported major.minor version 52.0
Why?
Hive expected this code to be compiled with Java 6, but you compiled with Java 8. To fix, compile the code with Java 6. On Eclipse, right click on Project -> Properties -> Java Compiler -> Compiler compliance level -> Pick 1.6 from dropdown
P.S. There is already a Hive built-in function to replace a character or string:
regexp_replace(colName, ‘\n’, ”)