With the release of Spark 2.2.0, they've touted the initial release of the cost
based optimizer. This article explains what is included and how it's likely to affect you.
Like many developers who spend a lot of time working with JVM languages,
I spend a lot of time having to actually debug running applications.
Throughout my interactions with my colleagues over the years, I've
learned that not everyone is aware of the great tools that ship with the
JDK (and some that don't) that can help you debug your application.
Specifically, the focus here is on applications that shine when you can't
actually attach …
From my post history you can tell that I spend most of my time working
on backend or non-user facing applications, but occasionally I make my
way back to the frontend in personal projects or otherwise. Lately I've
been playing with React, webpack and a bunch of other tools that fall
into the ecosystem.
One of the things that I was really struggling with initially though was
how to actually get webpack to load non-javascript …
A word of warning if you typically partition your
DataFrames/RDDs/Datasets in Spark based on a Integer, Long or Timestamp
keys. If you're using Spark's default partitioning settings (or even
something similar) and your values are sufficiently large and regularly
spaced out, it's possible that you'll see poor partitioning performance
partitioning these datasets.
This can manifest itself in DataFrame "repartition" or "groupBy"
commands and in the SQL "DISTRIBUTE BY" or "CLUSTER BY" keywords, among …
As I've mentioned my previous post on shuffles, shuffles in Spark can be
a source of huge slowdowns but for a lot of operations, such as joins,
they're necessary to do the computation. Or are they?
...
Yes, they are. But you can exercise some more control over your queries
and ensure that they only occur once if you know you're going to be
performing the same shuffle/join over and over again. We'll briefly
explore …