Spark is a general purpose distributed high performance computation engine that has APIs in many major languages like Java, Scala, Python.
S3 is Amazon Simple Storage Service for storing objects in a highly durable and reliable manner at very low cost.
Spark is used in combination with S3 for reading input and saving output data.
Spark can apply many transformations on input data, and finally store the data in some bucket on S3.
While processing large input data and storing output data on S3, I found that it is very fast in processing the data, but it’s very slow in writing the output data to S3.
Instead, I found that it’s very fast storing the data first on local HDFS (on Hadoop cluster), and then copy the data back to S3 from HDFS using s3-dist-cp (Amazon version of Hadoop’s distcp).
In one of the cases, to process data of 1TB, it took about 1.5 hrs to process, but about 4 hours to copy the output data to S3.
But with the above solution, it just took less than 5 min to copy the data to S3, saving lot of time and money.
s3-dist-cp command can be run from master node using the format below.
s3-dist-cp --src /input/data --dest s3://my-bucket/output OR hadoop jar s3-dist-cp.jar --src /input/data --dest s3://my-bucket/output
One thought on “S3 and Data Transfer issue from Spark”
Thanks for this post. I have a question, is it possible to do the same programmatically? I am running a Scala Spark job on an EMR cluster where I don’t have SSH access to the master node.The writes to S3 are greatly slowing down my running. If it’s not possible to accomplish this in Scala do you know if it is in Python?