-
spark error no space left on device 에러 해결빅데이터/Spark 2020. 12. 1. 03:22
join 작업이 많은 코드를 돌렸더니 ERROR: no space left on device 가 발생했다.
stack overflow 검색 결과, spark.local.dir를 변경하라고 나왔다.
Spark Application Properties 중 하나였는데
spark.local.dir
Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.
정의를 읽어봐도 왜 spark.local.dir를 변경해야 하는지 이해가 가지 않아서 spark documentaion을 읽어보았다."Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage collection may happen only after a long period of time, if the application retains references to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may consume a large amount of disk space. The temporary storage directory is specified by the spark.local.dir configuration parameter when configuring the Spark context."
spark shuffle이 일어나면 중간 파일들이 생성되는데 이 파일들이 저장되는 경로가 spark.local.dir로 설정되어 있다고 한다.
"Operations which can cause a shuffle include repartition operations like repartition and coalesce, ‘ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join."
shuffle을 유발할 수 있는 작업에는 repartition과 coalesce 같은 repartition operations과 ByKey 작업(groupByKey , reduceByKey), join operations이 있다고 한다.
join 작업이 많은 코드였기 때문에 spark.local.dir에 쓰여지는 용량이 컸을 것인데
default 설정인 /tmp 의 root disk 설정 자체가 작게 되어 있어서 발생했던 문제였다.
disk가 충분한 경로로 spark.local.dir 수정해주니 문제가 해결되었다.
spark.apache.org/docs/2.3.0/configuration.html
Configuration - Spark 2.3.0 Documentation
Spark Configuration Spark provides three locations to configure the system: Spark properties control most application parameters and can be set by using a SparkConf object, or through Java system properties. Environment variables can be used to set per-mac
spark.apache.org
'빅데이터 > Spark' 카테고리의 다른 글
Zeppelin Setting (0) 2020.11.30 spark executor core, memory 설정 (0) 2020.11.30