First Ansible Playbook

Posted in Uncategorized by yeskay

Ansible is a configuration management tool used by DevOps to automate the software installation and configuration process on any machine (node).

The following are the steps to run your first playbook (shown for Mac). Playbook is just a sequence of steps. This will run the playbook on your local host.

Install Ansible

brew install ansible

Open a file editor, type the following and save as myplaybook.yml

---
  - name: "First Ansible playbook"
    hosts: localhost
    connection: local
    tasks:

    - name: "ls command"
      shell: "ls"
      register: "output"

    - debug: var=output.stdout_lines

Run the following command to run the playbook.

ansible-playbook myplaybook.yml

The output is shown as below.

PLAY [First Ansible playbook] **********************************************************************************

TASK [Gathering Facts] *****************************************************************************************
ok: [localhost]

TASK [ls command] **********************************************************************************************
changed: [localhost]

TASK [debug] ***************************************************************************************************
ok: [localhost] => {
    "output.stdout_lines": [
        "myplaybook.yml"
    ]
}

PLAY RECAP *****************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

15Jan2020

Data Engineer Interview at Amazon

Posted in Uncategorized by yeskay

Recently I had an opportunity to interview for Data Engineer position at Amazon. I like to post info about the process and my experience so as to help anyone needing.

Process

After applied online, I got a call from recruiter asking for availably for telephonic interview with Hiring Manager. Telephonic interview had questions and discussion around my past experience. And few questions about complex situations I faced and how I solved.

Once cleared in that round, I had an onsite interview: this was five 50-minute rounds. Each is divided into two sections: a problem solving and few behavioral questions.

Problem solving focussed on SQL, Data Modeling, and ETL pipeline design.

Behavioral questions were based on Amazon leadership principles: like tell me about a situation where you tried to do something but midway you had to change course of action and how you communicated to customers and team members.

My experience

Recruiter was good and guide thru the entire process as needed.

Onsite travel plans and communication were good.

The leadership questions were annoying as most questions are repeated and I ran out of examples for the same questions asked again and again.

I am surprised to see no questions on AWS or Hadoop technologies though they are mentioned on the job requirements.

13Mar2016

What is training and model in Machine Learning

Posted in Uncategorized by yeskay

Machine Learning is about predicting future behavior based on past data. And this is done by machines (computers), hence the name Machine Learning. It’s also called Data Science in recent times.

What is model?

A model is an algorithm designed to draw some conclusions based on past data.

Example

Based on historical data of payments, it can be predicted the same behavior on a new person.

A person paying monthly rent, credit card, and mortgage payments on time => Another person paying rent, credit payment on time can pay mortgage payments on time too.

What is training a model?

Training a model is designing (computing) an algorithm based on some training data (sample data used to train).

P.S.: There is lot to learn on Machine Learning. These two words always confused me. It took long time to understand them.

13Jan2016

Sqoop Export a Long Text Field

Posted in Uncategorized by yeskay

I have a text field on HDFS that can have a very long value of more than 4000 characters length.

I had to export this to Oracle using Sqoop. In Oracle table, this field is defined with data type VARCHAR2(4000 BYTE).

I get an error when Sqoop’ing.

Caused by: java.sql.SQLException: ORA-01461: can bind a LONG value only for insert into a LONG column

Why?

Oracle treats value longer than the defined limit for that field as LONG. Hence the error. The error is not very informative though.

Solution:

Use a CLOB datatype for that field on Oracle. CLOB can store value longer than 4000 characters.

Don’t forget to add –map-column-java option to Sqoop export command. As there is no COLB type in Java/Hive. Hive has to know how to treat this value. Full command is shown below.

sqoop export --connect jdbc:oracle:thin:@hostname:1521/sid \
   --username user --password passwd \ 
   --table Oracle_Table_Name \
   --columns col1,col2,col3 \
   --map-column-java col3=String \
   --export-dir 'location' \
   --input-fields-terminated-by '\001' --input-lines-terminated-by "\n" \
   --input-null-string '\\N' --input-null-non-string '\\N'

7Jan2016

Hive UDF on AWS EMR

Posted in Uncategorized by yeskay

Hive UDF is a User Defined Function that can be applied to any input field on Hive table. It’s generally used to write a customized function.

Here I am trying to replace newline character (\n) in values of a column.

High Level Steps using Eclipse IDE (Mars 4.5 version)

Write a Java class and add a method called evaluate()

package com.mycompany;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class StripNewline extends UDF {
   private Text result = new Text();
   public Text evaluate(Text s) {
   try {
       String rep = s.toString().replaceAll("\\n", "");
       result.set(rep);
    } catch (Exception e) {
       result.set("Exception");
    }
    return result;
   }
}

Add Dependent external jars
- To compile the class, add the required JARS:
- Download hadoop-core.1.1.2.jar from Maven Repository
- Get hive-exec-0.13.1-amzn-1.jar from /home/hadoop/hive/lib on EMR EC2 machine
- Add these jars on Eclipse IDE using: Right click Project -> Configure Build Path ->Libraries -> Add External JARs (Navigate to where you downloaded)
Compile and Export as jar
- Right click project -> Export as Java JAR -> Choose a file name for jar
Copy the jar to EMR EC2 master node
- Using ‘scp’ command from a Linux Terminal or FTP to master node
Add jar to Hive shell and use the function in Hive query
- on Hive CLI prompt:

hive> ADD JAR /home/hadoop/myjar.jar;
hive> create temporary function repl as 'com.mycompany.StripNewline';
hive> select repl(colName) from tableName;

Possible Errors

While adding JAR on Hive CLI prompt, you may get the following error.

hive> create temporary function repl as 'com.mycompany.StripNewline';
java.lang.UnsupportedClassVersionError: com/mycompany/StripNewline : Unsupported major.minor version 52.0

Why?

Hive expected this code to be compiled with Java 6, but you compiled with Java 8. To fix, compile the code with Java 6. On Eclipse, right click on Project -> Properties -> Java Compiler -> Compiler compliance level -> Pick 1.6 from dropdown

P.S. There is already a Hive built-in function to replace a character or string:

regexp_replace(colName, ‘\n’, ”)

21Mar2015

Cloudera Hadoop Developer Certification

Posted in Uncategorized by yeskay

How I cleared exam

The test has 52 questions. There were some MapReduce coding questions too.
I took about 2 months of preparation
I read Hadoop Definitive Guide
I also kept reading some blogs online
There are also some quiz questions on Dattamsha.com.
I also practiced some MapReduce programs

Key Points:

Read the Tom White book and understand the concepts
Understand the MapReduce and some Java code
Read Sqoop commands to import and export data

28Jun2014

Some Hadoop Questions

Posted in Uncategorized by yeskay

How to make MapReduce job efficient?
Why does Hadoop sort keys?
How to sort values also?

	Hadoop training in H… on Cloudera Hadoop Developer…
	putnam120 on S3 and Data Transfer issue fro…
	Test Bank Mathematic… on AWS Solutions Architect…
	yeskay on Some Hadoop Questions
	mathivanan on Some Hadoop Questions

My Big Data Blog

My notes on Hadoop, Cloud, and other BigData technologies

Category Archives: Uncategorized

First Ansible Playbook

Data Engineer Interview at Amazon

Process

My experience

What is training and model in Machine Learning

Sqoop Export a Long Text Field

Hive UDF on AWS EMR

Cloudera Hadoop Developer Certification

Some Hadoop Questions