Spark | My Big Data Blog

Spark is a general purpose distributed high performance computation engine that has APIs in many major languages like Java, Scala, Python.

S3 is Amazon Simple Storage Service for storing objects in a highly durable and reliable manner at very low cost.

Spark is used in combination with S3 for reading input and saving output data.

Spark can apply many transformations on input data, and finally store the data in some bucket on S3.

While processing large input data and storing output data on S3, I found that it is very fast in processing the data, but it’s very slow in writing the output data to S3.

Instead, I found that it’s very fast storing the data first on local HDFS (on Hadoop cluster), and then copy the data back to S3 from HDFS using s3-dist-cp (Amazon version of Hadoop’s distcp).

In one of the cases, to process data of 1TB, it took about 1.5 hrs to process, but about 4 hours to copy the output data to S3.
But with the above solution, it just took less than 5 min to copy the data to S3, saving lot of time and money.

s3-dist-cp command can be run from master node using the format below.

s3-dist-cp --src /input/data --dest s3://my-bucket/output

OR

hadoop jar s3-dist-cp.jar --src /input/data --dest s3://my-bucket/output

	Hadoop training in H… on Cloudera Hadoop Developer…
	putnam120 on S3 and Data Transfer issue fro…
	Test Bank Mathematic… on AWS Solutions Architect…
	yeskay on Some Hadoop Questions
	mathivanan on Some Hadoop Questions

My Big Data Blog

My notes on Hadoop, Cloud, and other BigData technologies

Tag Archives: Spark

S3 and Data Transfer issue from Spark