SK DATA SHARE: Spark Broadcast Accumulators (Class -44)

Monday, May 16, 2022

Spark Broadcast Accumulators (Class -44)

Shared variables are second abstraction in Spark after RDDs that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program.

Spark supports two types of shared variables: Broadcast variables, which can be used to cache a value in memory on all nodes, and Accumulators, which are variables that are only “added” to, such as counters and sums.

a) Accumulators: This is a special kind of variable to aggregate the data to count the error records in particular dataset. There won’t be any shuffling here.