A word of warning if you typically partition your
DataFrames/RDDs/Datasets in Spark based on a Integer, Long or Timestamp
keys. If you're using Spark's default partitioning settings (or even
something similar) and your values are sufficiently large and regularly
spaced out, it's possible that you'll see poor partitioning performance
partitioning these datasets.
This can manifest itself in DataFrame "repartition" or "groupBy"
commands and in the SQL "DISTRIBUTE BY" or "CLUSTER BY" keywords, among …
As I've mentioned my previous post on shuffles, shuffles in Spark can be
a source of huge slowdowns but for a lot of operations, such as joins,
they're necessary to do the computation. Or are they?
Yes, they are. But you can exercise some more control over your queries
and ensure that they only occur once if you know you're going to be
performing the same shuffle/join over and over again. We'll briefly
I originally intended this to be a much longer post about memory in
Spark, but I figured it would be useful to just talk about Shuffles
generally so that I could brush over it in the Memory discussion and
just make it a bit more digestible. Shuffles are one of the most
memory/network intensive parts of most Spark jobs so it's important to
understand when they occur and what's going on when you're trying …
If you've dug around into the Spark SQL code, whether it's in catalyst,
or the core code, you've probably seen tons of references to Logical and
Physical plans. These are the core units of query planning, a common
concept in database programming. In this post we're going to go over
what these are at a high level and then how they're represented and used
in Spark. The query planning code is especially important because it …
Note: This is being written as of early December of 2015 and currently
assumes Spark 1.5.2 API. The data sources API has been out for a few
versions now but it's still stabilizing so some of this might change to
be out of date.
I'm writing this guide as part of my own exploration on how to write a
data source using the Spark SQL Data Sources API. We'll start with
exploration of …