Some keywords and error message
Commonly searched keywords/terms and error messages organized by job type.
Spark keywords
Spark key term  | Explanation  | 
|---|---|
Deploy mode  | Specifies where the driver runs. In “cluster” mode the driver runs on the cluster. In “client” mode the driver runs on the edge node, outside of the cluster.  | 
Driver  | The process that coordinates the application execution.  | 
Executor  | The process launched by the application on a worker node.  | 
Resilient Distributed Dataset (RDD)  | Fault tolerant distributed dataset.  | 
spark.default.parallelism  | Default number of partitions.  | 
spark.dynamicAllocation.enabled  | Enables dynamic allocation in Spark.  | 
spark.executor.memory  | Related to executor memory.  | 
spark.io.compression.codec  | Codec used to compress RDDs, the event log file, and broadcast variables.  | 
spark.shuffle.service.enabled  | Enables the external shuffle service to preserve shuffle files even when executors are removed. It is required by dynamic allocation.  | 
spark.shuffle.spill.compress  | Specifies whether to compress the shuffle files.  | 
spark.sql.shuffle.partitions  | Number of SparkSQL partitions.  | 
spark.yarn.executor.memoryOverhead  | YARN memory overhead.  | 
SparkContext  | Main Spark entry point; used to create RDDs, accumulators, and broadcast variables.  | 
SparkConf  | Spark configuration object.  | 
SQLContext  | Main Spark SQL entry point.  | 
StreamingContext  | Main Spark Streaming entry point.  | 
Spark error messages
Spark error messages  | Explanation  | 
|---|---|
Container killed by YARN for exceeding memory limits.  | The amount of off-heap memory was insufficent at container level. "spark.yarn.executor.memoryOverhead" should be increased to a larger value.  | 
java.io.IOException: Connection reset by peer  | Connection reset by peer. Generally occurs in the driver logs when some of the executors fail or are shutdown unexpectedly.  | 
java.lang.OutOfMemoryError  | Out of memory error, insufficient Java heap space at executor or driver levels.  | 
org.apache.hadoop.mapred.InvalidInputException  | Input path does not exist.  | 
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow.  | Buffer overflow.  | 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException  | Exception observed when executing a SparkSQL query on non existing data.  | 
MapReduce/Hive keywords
Key Term  | Explanation  | 
|---|---|
hive.exec.parallel  | Determines whether to execute jobs in parallel.  | 
hive.exec.reducers.bytes.per.reducer  | The size per reducer.  | 
io.sort.mb  | The total amount of buffer memory to use while sorting files, in megabytes.  | 
io.sort.record.percent  | The percentage of io.sort.mb dedicated to tracking record boundaries.  | 
mapreduce.input.fileinputformat.split.maxsize  | Buffer overflow.  | 
mapreduce.input.fileinputformat.split.minsize  | Maximum chunk size map input should be split into.  | 
mapreduce.job.reduces  | Default number of reduce tasks per job.  | 
mapreduce.map.cpu.vcores  | Number of virtual cores to request from the scheduler for each map task.  | 
mapreduce.map.java.opts  | JVM heap size for each map task.  | 
mapreduce.map.memory.mb  | The amount of memory to request from the scheduler for each map task.  | 
mapreduce.reduce.cpu.vcores  | Number of virtual cores to request from the scheduler for each reduce task.  | 
mapreduce.reduce.java.opts  | JVM heap size for each reduce task.  | 
mapreduce.reduce.memory.mb  | The amount of memory to request from the scheduler for each reduce task.  |