Mostly, I focus on data frames and datasets, because in Apache Spark 2.0, these two APIs are integrated.Our primary motivation behind this integration is our quest to simplify Spark by limiting the number of concepts you need to learn and providing ways to process unstructured data.
At the center, RDD is an immutable distributed collection of elements of your data, divided into nodes in your cluster, that can run parallel to a low-level API that provides transitions and actions.When to use RDDs?Consider these scenarios or common use scenarios for using RDDs:Your dataset requires low-level transformation and actions and control; 2.Your data is structured like media streams or text streamsYou want to replace your data with functional programming structures rather than domain-specific expressions;You do not care about imposing a schema, such as columns, when processing or accessing data attributes by name or column; AndYou can waive some of the optimization and performance benefits that come with data frames and datasets for structured and semi-structured data.What happens to the RDDs in Apache Spark 2.0?You might ask: Are deporting RDDs as second class citizens?
Because of this integration, developers now have fewer concepts to learn or remember and work with a single high-level and type-safe API called Dataset.SetsStarting with Spark 2.0, the dataset takes on two distinct APIs: a strongly typed API and an untyped API, as shown in the table below.
Static-Typing and Runtime Type-SafetyConsider static-typing and runtime security as a spectrum, and SQL is not limited to a dataset.
For example, in your Spark SQL string queries, you won’t know the syntax error until the runtime (which can be expensive), but you can catch errors during compile in data frames and datasets (which saves developer-time and costs).
Also, when using datasets your analysis error can be detected at compile-time, thus saving the developer time and costs.All of this in addition to the syntax and analysis error in your Spark code, the spectrum of type-safety, datasets are very limited and productive for the developer.2.