Memory error in Spark

Spark is a general purpose cluster computing framework used to process large datasets. It runs on Standalone, Mesos, and YARN clusters.

You often encounter memory issue like something below.

No space left on device.

Cause:

Spark uses disk space on worker nodes to store shuffle data/files. So if the disk space is not enough to hold the data you’re processing, it will throw error.

Solution:

I used to get this error on AWS EMR while running spark jobs. By increasing the EBS volume size on core nodes using the below AWS CLI, I was able to fix the memory issue. Another solution is to increase the cluster size so the shuffle file sizes would be smaller.

aws emr create-cluster --release-label emr-5.9.0  --use-default-roles --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=d2.xlarge 'InstanceGroupType=CORE,InstanceCount=2,InstanceType=d2.xlarge,EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,SizeInGB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iops=100},VolumesPerInstance=4}]}' --auto-terminate

Reference:

https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html

	Hadoop training in H… on Cloudera Hadoop Developer…
	putnam120 on S3 and Data Transfer issue fro…
	Test Bank Mathematic… on AWS Solutions Architect…
	yeskay on Some Hadoop Questions
	mathivanan on Some Hadoop Questions

My Big Data Blog

My notes on Hadoop, Cloud, and other BigData technologies

Memory error in Spark

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply