Hydronitrogen Tech Blog

Hamel Ajay Kothari writes about computers and stuff.



Profiling a Running JVM: tools you should know

Like many developers who spend a lot of time working with JVM languages, I spend a lot of time having to actually debug running applications. Throughout my interactions with my colleagues over the years, I've learned that not everyone is aware of the great tools that ship with the JDK (and some that don't) that can help you debug your application.

Specifically, the focus here is on applications that shine when you can't actually attach …


Continue reading →

Posted in Java on

Using Bootstrap in a ES6 Webpack Application

From my post history you can tell that I spend most of my time working on backend or non-user facing applications, but occasionally I make my way back to the frontend in personal projects or otherwise. Lately I've been playing with React, webpack and a bunch of other tools that fall into the ecosystem.

One of the things that I was really struggling with initially though was how to actually get webpack to load non-javascript …


Continue reading →

Posted in Web on

Poor Hash Partitioning of Timestamps, Integers and Longs in Spark

A word of warning if you typically partition your DataFrames/RDDs/Datasets in Spark based on a Integer, Long or Timestamp keys. If you're using Spark's default partitioning settings (or even something similar) and your values are sufficiently large and regularly spaced out, it's possible that you'll see poor partitioning performance partitioning these datasets.

This can manifest itself in DataFrame "repartition" or "groupBy" commands and in the SQL "DISTRIBUTE BY" or "CLUSTER BY" keywords, among …


Continue reading →

Posted in Spark on

Shuffle Free Joins in Spark SQL

As I've mentioned my previous post on shuffles, shuffles in Spark can be a source of huge slowdowns but for a lot of operations, such as joins, they're necessary to do the computation. Or are they?

...

Yes, they are. But you can exercise some more control over your queries and ensure that they only occur once if you know you're going to be performing the same shuffle/join over and over again. We'll briefly explore …


Continue reading →

Posted in Spark on

Apache Spark Shuffles Explained In Depth

I originally intended this to be a much longer post about memory in Spark, but I figured it would be useful to just talk about Shuffles generally so that I could brush over it in the Memory discussion and just make it a bit more digestible. Shuffles are one of the most memory/network intensive parts of most Spark jobs so it's important to understand when they occur and what's going on when you're trying …


Continue reading →

Posted in Spark on

Page 1 / 2 »


Powered by Pelican, Python, Markdown and tons of other helpful stuff.