Hydronitrogen Tech Blog

Hamel Ajay Kothari writes about computers and stuff.

Spark 2.2.0 - Cost Based Optimizer Explained

With the release of Spark 2.2.0, they've touted the initial release of the cost based optimizer. This article explains what is included and how it's likely to affect you.


The release notes contain the following highlights for the cost based optimizer:

  • SPARK-17075 SPARK-17076 SPARK-19020 SPARK-17077 SPARK-19350: Cardinality estimation for filter, join, aggregate, project and limit/sample …

Continue reading →

Posted in Spark on

Profiling a Running JVM: tools you should know

Like many developers who spend a lot of time working with JVM languages, I spend a lot of time having to actually debug running applications. Throughout my interactions with my colleagues over the years, I've learned that not everyone is aware of the great tools that ship with the JDK (and some that don't) that can help you debug your application.

Specifically, the focus here is on applications that shine when you can't actually attach …

Continue reading →

Posted in Java on

Using Bootstrap in a ES6 Webpack Application

From my post history you can tell that I spend most of my time working on backend or non-user facing applications, but occasionally I make my way back to the frontend in personal projects or otherwise. Lately I've been playing with React, webpack and a bunch of other tools that fall into the ecosystem.

One of the things that I was really struggling with initially though was how to actually get webpack to load non-javascript …

Continue reading →

Posted in Web on

Poor Hash Partitioning of Timestamps, Integers and Longs in Spark

A word of warning if you typically partition your DataFrames/RDDs/Datasets in Spark based on a Integer, Long or Timestamp keys. If you're using Spark's default partitioning settings (or even something similar) and your values are sufficiently large and regularly spaced out, it's possible that you'll see poor partitioning performance partitioning these datasets.

This can manifest itself in DataFrame "repartition" or "groupBy" commands and in the SQL "DISTRIBUTE BY" or "CLUSTER BY" keywords, among …

Continue reading →

Posted in Spark on

Shuffle Free Joins in Spark SQL

As I've mentioned my previous post on shuffles, shuffles in Spark can be a source of huge slowdowns but for a lot of operations, such as joins, they're necessary to do the computation. Or are they?


Yes, they are. But you can exercise some more control over your queries and ensure that they only occur once if you know you're going to be performing the same shuffle/join over and over again. We'll briefly explore …

Continue reading →

Posted in Spark on

Page 1 / 2 »

Powered by Pelican, Python, Markdown and tons of other helpful stuff.