Some keywords and error message
Commonly searched keywords/terms and error messages organized by job type.
Spark keywords
Spark key term | Explanation |
---|---|
Deploy mode | Specifies where the driver runs. In “cluster” mode the driver runs on the cluster. In “client” mode the driver runs on the edge node, outside of the cluster. |
Driver | The process that coordinates the application execution. |
Executor | The process launched by the application on a worker node. |
Resilient Distributed Dataset (RDD) | Fault tolerant distributed dataset. |
spark.default.parallelism | Default number of partitions. |
spark.dynamicAllocation.enabled | Enables dynamic allocation in Spark. |
spark.executor.memory | Related to executor memory. |
spark.io.compression.codec | Codec used to compress RDDs, the event log file, and broadcast variables. |
spark.shuffle.service.enabled | Enables the external shuffle service to preserve shuffle files even when executors are removed. It is required by dynamic allocation. |
spark.shuffle.spill.compress | Specifies whether to compress the shuffle files. |
spark.sql.shuffle.partitions | Number of SparkSQL partitions. |
spark.yarn.executor.memoryOverhead | YARN memory overhead. |
SparkContext | Main Spark entry point; used to create RDDs, accumulators, and broadcast variables. |
SparkConf | Spark configuration object. |
SQLContext | Main Spark SQL entry point. |
StreamingContext | Main Spark Streaming entry point. |
Spark error messages
Spark error messages | Explanation |
---|---|
Container killed by YARN for exceeding memory limits. | The amount of off-heap memory was insufficent at container level. "spark.yarn.executor.memoryOverhead" should be increased to a larger value. |
java.io.IOException: Connection reset by peer | Connection reset by peer. Generally occurs in the driver logs when some of the executors fail or are shutdown unexpectedly. |
java.lang.OutOfMemoryError | Out of memory error, insufficient Java heap space at executor or driver levels. |
org.apache.hadoop.mapred.InvalidInputException | Input path does not exist. |
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. | Buffer overflow. |
org.apache.spark.sql.catalyst.errors.package$TreeNodeException | Exception observed when executing a SparkSQL query on non existing data. |
MapReduce/Hive keywords
Key Term | Explanation |
---|---|
hive.exec.parallel | Determines whether to execute jobs in parallel. |
hive.exec.reducers.bytes.per.reducer | The size per reducer. |
io.sort.mb | The total amount of buffer memory to use while sorting files, in megabytes. |
io.sort.record.percent | The percentage of io.sort.mb dedicated to tracking record boundaries. |
mapreduce.input.fileinputformat.split.maxsize | Buffer overflow. |
mapreduce.input.fileinputformat.split.minsize | Maximum chunk size map input should be split into. |
mapreduce.job.reduces | Default number of reduce tasks per job. |
mapreduce.map.cpu.vcores | Number of virtual cores to request from the scheduler for each map task. |
mapreduce.map.java.opts | JVM heap size for each map task. |
mapreduce.map.memory.mb | The amount of memory to request from the scheduler for each map task. |
mapreduce.reduce.cpu.vcores | Number of virtual cores to request from the scheduler for each reduce task. |
mapreduce.reduce.java.opts | JVM heap size for each reduce task. |
mapreduce.reduce.memory.mb | The amount of memory to request from the scheduler for each reduce task. |