Hydronitrogen Tech Blog

Hamel Ajay Kothari writes about computers and stuff.



Articles in the sql tag

Poor Hash Partitioning of Timestamps, Integers and Longs in Spark

A word of warning if you typically partition your DataFrames/RDDs/Datasets in Spark based on a Integer, Long or Timestamp keys. If you're using Spark's default partitioning settings (or even something similar) and your values are sufficiently large and regularly spaced out, it's possible that you'll see poor partitioning performance partitioning these datasets.

This can manifest itself in DataFrame "repartition" or "groupBy" commands and in the SQL "DISTRIBUTE BY" or "CLUSTER BY" keywords, among …


Continue reading →

Posted in Spark on

In the Code: Spark SQL Query Planning and Execution

If you've dug around into the Spark SQL code, whether it's in catalyst, or the core code, you've probably seen tons of references to Logical and Physical plans. These are the core units of query planning, a common concept in database programming. In this post we're going to go over what these are at a high level and then how they're represented and used in Spark. The query planning code is especially important because it …


Continue reading →

Posted in Spark on

Writing a Spark Data Source

Note: This is being written as of early December of 2015 and currently assumes Spark 1.5.2 API. The data sources API has been out for a few versions now but it's still stabilizing so some of this might change to be out of date.

I'm writing this guide as part of my own exploration on how to write a data source using the Spark SQL Data Sources API. We'll start with exploration of …


Continue reading →

Posted in Spark on


Powered by Pelican, Python, Markdown and tons of other helpful stuff.