Hydronitrogen Tech Blog

Hamel Ajay Kothari writes about computers and stuff.



Articles in the skew tag

Poor Hash Partitioning of Timestamps, Integers and Longs in Spark

A word of warning if you typically partition your DataFrames/RDDs/Datasets in Spark based on a Integer, Long or Timestamp keys. If you're using Spark's default partitioning settings (or even something similar) and your values are sufficiently large and regularly spaced out, it's possible that you'll see poor partitioning performance partitioning these datasets.

This can manifest itself in DataFrame "repartition" or "groupBy" commands and in the SQL "DISTRIBUTE BY" or "CLUSTER BY" keywords, among …


Continue reading →

Posted in Spark on


Powered by Pelican, Python, Markdown and tons of other helpful stuff.