How to run PySpark on Windows

TL;DR version: install either Java 17 or Java 21. Install PySpark (e.g. via pip). Run it.


Longer Version

Story Time

As part of work for a client, we’ve been looking at using Microsoft Fabric and PowerBI to pull together data, do analysis using the IES platform (my team’s core product), and present the results in a nice & friendly UI.

After a bunch of trial and error, we settled on a decent pattern where we did the bulk of the logic of the ETL scripts in regular old Python code, and then called that from Fabric Notebooks (kind of like Jupyter Notebooks) which did all of the data querying using PySpark.

I say we, but I personally wasn’t involved in that part of the project – I was dealing with another part of the tech stack. But I was overseeing it, and I decided I didn’t like how much time my team mates needed to spend developing _on the Fabric console_ to troubleshoot things like the PySpark queries. Not to mention non-Fabric specific code that was sneaking into those notebooks and away from where we could easily test it.

So I suggested that they look into running PySpark tests locally, using cached Parquet data. I took a bit of time out of my other work, and knocked up a spike test showing how it worked, and handed it over. The problem is I do most of my work on a Mac, where I’ve got sensible development tools, but the rest of my team use work-issued Windows laptops. They couldn’t get the tests to run.

So they went around looking for guidance on how to get PySpark to work on Windows. There’s heaps and heaps of advice. They all boil down to the following: you need to install Hadoop Winutils – a native software library that provides I/O support because for some reason Hadoop didn’t want to use the regular Java I/O.

Problems with Winutils

Here’s the rub: there’s no “official” distribution of Hadoop Winutils. The Hadoop project doesn’t offer one. Many of the guides out there say “get a community-built version”, without saying where to find one. The most official one (which I linked above) is provided by a Hadoop contributor, but he stopped providing updates years ago, with Hadoop 3.0. He links to another repo, which stopped providing updates years ago, but at least go to Hadoop 3.3. There’s nobody apparently providing binaries for Hadoop 3.4, which is what’s bundled with PySpark 4.x (current version is 4.1.1 as of the time of writing).

Liars, all of them. Liars!

Maybe it was needed once. Maybe it’s still needed for some things. But at least with PySpark 3.5.8, it works just fine to load up a Parquet file and do queries. As long as you’re using Java 17 or Java 21, anyway. You get a runtime error with Java 25 – which, for some reason, the devs who looked at this problem had updated to. Unnecessarily, mind you – the Java apps we build for IES are all based on Java 17 (to be updated to Java 21 later this year).

The biggest liars here, though, were the LLMs that my teammates were using to help them figure out what was going on. The Google AI summary, Chat GPT, Copilot, Gemini – all had been trained on those same articles, and all insisted that you need Hadoop Winutils (without saying where to get it). It even sidetracked me for a day (because I spent time replicating the research done by my coworkers to reproduce the problem). It wasn’t until I decided to ignore that research and start from scratch did I discover the problem.

Takeaways

  1. Don’t overcomplicate things.
  2. Don’t trust LLMs – they are always out of date.
  3. Unit tests rock. Don’t be lazy and find excuses not to test as much of your code as possible locally.
  4. There’s always a way to run more locally.