Theoretical Interview Question in Spark

##Question Asked in Interview:

What is the difference between RDD Lineage Graph and Directed Acyclic Graph (DAG) in Spark?

Spark is lazy evaluated means when a transformation(map or filter etc) is called, it is not executed by Spark immediately, instead each RDD maintains a pointer to one or more parent RDDs along with the metadata about what type of relationship it has with the parent RDD.

For example, when we call val b = a.map() on a RDD, the RDD b just keeps a reference (and never copies) to its parent a, that's a lineage. A lineage is created for each transformation. A lineage will keep track of what all transformations has to be applied on that RDD, including the location from where it has to read the data. It creates a logical execution plan.

This RDD lineage is used to recompute the data if there are any faults as it contains the pattern of the computation.

DAG (Directed Acyclic Graph):is a collection of all the RDD and the corresponding transformations on them. A DAG is created when the user creates a RDD and applies transformations on it. When the action is called, the DAG is given to the DAG Scheduler which divides it into stages. Spark DAG allows the user to look at the expanded stage and understand the various transformations on it. As the end to end plan is available, it can be optimized that is, we can mix the transformations and reduce the shuffling of data. The DAG does not have a dedicated storage but is in memory and not on disk. A DAG can help in fault tolerance.

At high level, when any action is called on the RDD, Spark creates the DAG and submits it to the DAG scheduler.

What is the difference between spark context and spark session?

Prior to Spark 2.0.0 sparkContext was used as a channel to access all spark functionality.

The spark driver program uses spark context to connect to the cluster through a resource manager (YARN orMesos..)

In order to use APIs of SQL,HIVE , and Streaming, separate contexts need to be created.

like:
val conf=newSparkConf()
val sc = new SparkContext(conf)
val hc = new hiveContext(sc)
val ssc = new streamingContext(sc)

Spark 2.0.0 onwards SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with Dataframe and Dataset APIs. All the functionality available with sparkContext are also available in sparkSession.

In order to use APIs of SQL, HIVE, and Streaming, no need to create separate contexts as sparkSession includes all the APIs.

Once the SparkSession is instantiated, we can configure Spark’s run-time config properties.

What is the difference between a map and mapPartitions in Spark?

Both map() and mapPartition() are transformations available in RDD class.

Imagine that RDD as a group of many Rows. Spark Api’s convert these Rows to multiple partitions.

If there are 1000 row and 10 partitions, then each partition will contain the 1000/10=100 Rows.

Now, when we apply map(func) method to rdd, the func() operation will be applied on each and every Row and in this particular case func() operation will be called 1000 times. i.e. time consuming in some time critical applications.

If we call mapPartition(func) method on rdd, the func() operation will be called on each partition instead of each row. In this particular case, it will be called 10 times(number of partition). In this way you can prevent some processing when it comes to time critical application.

What is Catalyst Optimizer in Apache Spark

Catalyst optimizer: optimizes logical plan of the SQL query which run with Spark Sql.

Its a set of rules which apply on the sql query to rewrite it in better way to gain performance. For example, sql WHERE filters will be applied first on initial data if possible instead applying at the end.

Broadcast variables, which can be used to cache a value in memory on all nodes, and Accumulators, which are variables that are only “added” to, such as counters and sums

SaveAsTable: creates the table structure and stores the first version of the data. However, the overwrite save mode works over all the partitions even when dynamic is configured.

insertInto: does not create the table structure, however, the overwrite save mode works only the needed partitions when dynamic is configured.

Data Engineering

Theoretical Interview Question in Spark

No comments:

Post a Comment

Popular

Tags

Pages