logo
logo
Sign in

The Three Apache Spark APIs: RDDs vs DataFrames and Datasets

avatar
venkat k
The Three Apache Spark APIs: RDDs vs DataFrames and Datasets

To the delight of all developers, nothing is more appealing than a set of APIs that make developers productive, easy to use, and expressive. Apache Spark’s appeal to developers is one of the easiest APIs to use in large datasets and languages: Scala, Java, Python, and R.

In this blog, I explore three sets of APIs — RDDs, Dataframes, and Datasets available for Apache Spark 2.2 and above; Why and when you should use each set; Describe their performance and optimization benefits; And describe scenarios for when to use data frames and datasets instead of RDDs. Mostly, I focus on data frames and datasets, because in Apache Spark 2.0, these two APIs are integrated.

Our primary motivation behind this integration is our quest to simplify Spark by limiting the number of concepts you need to learn and providing ways to process unstructured data. And by architecture, Spark can provide high-level abstraction and APIs as domain-specific language structures.

Elastic Distribution Dataset (RDD)
RDD has been a user-facing basic API in Spark since its inception. At the center, RDD is an immutable distributed collection of elements of your data, divided into nodes in your cluster, that can run parallel to a low-level API that provides transitions and actions.

When to use RDDs?
Consider these scenarios or common use scenarios for using RDDs:

  1. Your dataset requires low-level transformation and actions and control; 2.
  2. Your data is structured like media streams or text streams
  3. You want to replace your data with functional programming structures rather than domain-specific expressions;
  4. You do not care about imposing a schema, such as columns, when processing or accessing data attributes by name or column; And
  5. You can waive some of the optimization and performance benefits that come with data frames and datasets for structured and semi-structured data.

What happens to the RDDs in Apache Spark 2.0?
You might ask: Are deporting RDDs as second class citizens? Are they being removed?

The answer is not amazing!

What’s more, as you can see below, you can move seamlessly between DataFrame or Dataset and RDD simple Data and Dataframes and Datasets are built on top of RDDs by simple API method calls.

DataFrames
Like an RDD, a data frame is a collection of unmodified distributed data. Unlike RDD, the data is organized into columns named as a table in the relational database. The DataFrame, designed to simplify the processing of large data sets, enables developers to impose a structure on distributed data collection, which allows high-level capture; It provides a domain-specific language API for converting your distributed data, And Spark will be available to a wider audience beyond specialized data engineers.

In our preview of the Apache Spark 2.0 webinar and the next blog, in Spark 2.0, we mentioned that DataFrame APIs integrate with Dataset APIs and integrate data processing capabilities in libraries. Because of this integration, developers now have fewer concepts to learn or remember and work with a single high-level and type-safe API called Dataset.

Sets
Starting with Spark 2.0, the dataset takes on two distinct APIs: a strongly typed API and an untyped API, as shown in the table below. Conceptually, consider the DataFrame as a nickname for a collection of generic objects Dataset [Row], where a Row is a simple unmodified JVM object. The dataset, by contrast, is a collection of strongly typed JVM objects, defined by the case class you defined in Scala or the class in Java.

Typed and Un-typed APIs

Advantages of Dataset APIs
As a Spark developer, you benefit in many ways with the data frame and dataset integrated APIs in Spark 2.0.

1. Static-Typing and Runtime Type-Safety
Consider static-typing and runtime security as a spectrum, and SQL is not limited to a dataset. For example, in your Spark SQL string queries, you won’t know the syntax error until the runtime (which can be expensive), but you can catch errors during compile in data frames and datasets (which saves developer-time and costs). That is, if you enable a function in a data frame that is not part of the API, the compiler will capture it. However, it does not recognize a column name that is not up to runtime.

The dataset is at the very end of the spectrum, very limited. Because the dataset APIs are all expressed as lambda functions and JVM typed objects, an imbalance of typed parameters can be found at compile time. Also, when using datasets your analysis error can be detected at compile-time, thus saving the developer time and costs.

All of this in addition to the syntax and analysis error in your Spark code, the spectrum of type-safety, datasets are very limited and productive for the developer.

2. High-level abstraction and adaptive viewing on structured and semi-structured data
Dataframes [Row] as a collection of datasets provide a structured custom view on your semi-structured data. For example, you have a huge IoT device event dataset that is decrypted as JSON. Since JSON is a semi-structured format, it lends itself well to using the dataset as a collection of the strongly typed-specific dataset [DeviceIoTData].

Under the hood in the code above, here are three things:

Spark reads JSON, infuses the schema and creates a collection of data frames.
At this point, Spark converts your data into DataFrame = Dataset [Row], which is a collection of simple Row object, the exact type of which is unknown.
Now, the Spark Dataset [Row] -> Dataset [DeviceIteoData] converts the type-specific Scala JVM object to the class deviceIteoData specified.
Most of us who work with structured data have the habit of viewing and processing data in a columnar fashion or accessing specific features in an object. Dataset [ElementType] With the Dataset as a collection of typed objects, you get both compile-time security and compatibility viewing for strongly typed JVM objects. From the above code, you can easily display or process a strongly typed dataset [T] with top-level methods.

3. Easy to use APIs with architecture
Although it limits structure control over what your Spark program can do with data, it introduces great semantics and easy domain-specific operations that can be expressed as high-level structures. However, most computations can be accomplished with high-level APIs of the dataset. For example, accessing DeviceIoTData of a dataset typed object is much easier to perform than total, select, sum, average, map, filter or group by operations than using RDD rows of data fields.

Explaining your calculation in a domain-specific API is much easier and easier than with relational algebraic type expressions (in RDDs). For example, the below code filter () and map () create another unmodified dataset.

4. Performance and Optimization
In addition to all of the above benefits, you cannot ignore the space efficiency and performance benefits of using DataFrames and Dataset APIs for two reasons.

First, because the DataFrame and Dataset APIs are built on the Spark SQL engine, it uses the catalyst to create an optimized logical and physical query plan. In R, Java, Scala, or Python Dataframe / Dataset APIs, all Relation type queries fall under the same code optimizer, which provides space and speed capability. Although the Dataset [T] typed API is optimized for data engineering tasks, the non-typed Dataset [Row] (alias of the data frame) is suitable for more rapid and interactive analysis.

Second, since a compiler understands your Dataset type JVM object, it maps your type-specific JVM object to encoders using Tungsten’s internal memory representation. As a result, Tungsten Encoders can efficiently serialize/deserialize JVM objects as well as generate compact bytecode that speeds higher.

When should I use DataFrames or Datasets?
If you want rich semantics, high-level abstractions, and domain-specific APIs, use DataFrame or Dataset.
If your processing demands high-level expressions, filters, maps, aggregation, averages, sums, SQL queries, columnar access and use of semi-structured data on lambda functions, use DataFrame or Dataset.
Tungsten’s efficient code generation, use Dataset to optimize Catalyst optimization and benefit from typed JVM objects.
Spark Libraries, Use DataFrame or Dataset across APIs of Unification and Simplification.
If you are an R user, use DataFrames.
If you are a Python user, use DataFrames and resort to RDDs if you need more control.
Note that you can always seamlessly interoperate or convert data frame and / or Dataset to an RDD, by simple method call .rdd. For instance

Bringing it all together
In summary, the choice of when to use RDD or Dataframe and / or Dataset is obvious. The former gives you low-level functionality and control, the latter allows for custom viewing and building, provides high-level and domain-specific operations, saves space and executes at a high speed.

Considering the lessons we learned from Spark’s early releases — how to simplify, optimize, and transform Spark for developers — we decided to increase low-level RDD APIs to high-level compression as a data frame and dataset. Generate this consolidated data capture in the above libraries on Catalyst Optimizer and Tungsten.

Choose the Dataframes and / or Dataset or RDDs APIs that meet your needs and use cases, but I wouldn’t be surprised if you fall into the camp of most developers working with architecture and semi-structured data.

collect
0
avatar
venkat k
guide
Zupyak is the world’s largest content marketing community, with over 400 000 members and 3 million articles. Explore and get your content discovered.
Read more