Hydronitrogen Tech Blog

Hamel Ajay Kothari writes about computers and stuff.



Articles in the Spark category

Spark 2.2.0 - Cost Based Optimizer Explained

With the release of Spark 2.2.0, they've touted the initial release of the cost based optimizer. This article explains what is included and how it's likely to affect you.

Overview

The release notes contain the following highlights for the cost based optimizer:

  • SPARK-17075 SPARK-17076 SPARK-19020 SPARK-17077 SPARK-19350: Cardinality estimation for filter, join, aggregate, project and limit/sample …

Continue reading →

Posted in Spark on

Poor Hash Partitioning of Timestamps, Integers and Longs in Spark

A word of warning if you typically partition your DataFrames/RDDs/Datasets in Spark based on a Integer, Long or Timestamp keys. If you're using Spark's default partitioning settings (or even something similar) and your values are sufficiently large and regularly spaced out, it's possible that you'll see poor partitioning performance partitioning these datasets.

This can manifest itself in DataFrame "repartition" or "groupBy" commands and in the SQL "DISTRIBUTE BY" or "CLUSTER BY" keywords, among …


Continue reading →

Posted in Spark on

Shuffle Free Joins in Spark SQL

As I've mentioned my previous post on shuffles, shuffles in Spark can be a source of huge slowdowns but for a lot of operations, such as joins, they're necessary to do the computation. Or are they?

...

Yes, they are. But you can exercise some more control over your queries and ensure that they only occur once if you know you're going to be performing the same shuffle/join over and over again. We'll briefly explore …


Continue reading →

Posted in Spark on

Apache Spark Shuffles Explained In Depth

I originally intended this to be a much longer post about memory in Spark, but I figured it would be useful to just talk about Shuffles generally so that I could brush over it in the Memory discussion and just make it a bit more digestible. Shuffles are one of the most memory/network intensive parts of most Spark jobs so it's important to understand when they occur and what's going on when you're trying …


Continue reading →

Posted in Spark on

In the Code: Spark SQL Query Planning and Execution

If you've dug around into the Spark SQL code, whether it's in catalyst, or the core code, you've probably seen tons of references to Logical and Physical plans. These are the core units of query planning, a common concept in database programming. In this post we're going to go over what these are at a high level and then how they're represented and used in Spark. The query planning code is especially important because it …


Continue reading →

Posted in Spark on

Page 1 / 2 »


Powered by Pelican, Python, Markdown and tons of other helpful stuff.