Author: Robert Watkins

My name is Robert Watkins. I am a software developer and have been for over 20 years now. I currently work for people, but my opinions here are in no way endorsed by them (which is cool; their opinions aren’t endorsed by me either). My main professional interests are in Java development, using Agile methods, with a historical focus on building web based applications. I’m also a Mac-fan and love my iPhone, which I’m currently learning how to code for. I live and work in Brisbane, Australia, but I grew up in the Northern Territory, and still find Brisbane too cold (after 22 years here). I’m married, with two children and one cat. My politics are socialist in tendency, my religious affiliation is atheist (aka “none of the above”), my attitude is condescending and my moral standing is lying down.

How to run PySpark on Windows

TL;DR version: install either Java 17 or Java 21. Install PySpark (e.g. via pip). Run it.

Longer Version

Story Time

As part of work for a client, we’ve been looking at using Microsoft Fabric and PowerBI to pull together data, do analysis using the IES platform (my team’s core product), and present the results in a nice & friendly UI.

After a bunch of trial and error, we settled on a decent pattern where we did the bulk of the logic of the ETL scripts in regular old Python code, and then called that from Fabric Notebooks (kind of like Jupyter Notebooks) which did all of the data querying using PySpark.

I say we, but I personally wasn’t involved in that part of the project – I was dealing with another part of the tech stack. But I was overseeing it, and I decided I didn’t like how much time my team mates needed to spend developing _on the Fabric console_ to troubleshoot things like the PySpark queries. Not to mention non-Fabric specific code that was sneaking into those notebooks and away from where we could easily test it.

So I suggested that they look into running PySpark tests locally, using cached Parquet data. I took a bit of time out of my other work, and knocked up a spike test showing how it worked, and handed it over. The problem is I do most of my work on a Mac, where I’ve got sensible development tools, but the rest of my team use work-issued Windows laptops. They couldn’t get the tests to run.

So they went around looking for guidance on how to get PySpark to work on Windows. There’s heaps and heaps of advice. They all boil down to the following: you need to install Hadoop Winutils – a native software library that provides I/O support because for some reason Hadoop didn’t want to use the regular Java I/O.

Problems with Winutils

Here’s the rub: there’s no “official” distribution of Hadoop Winutils. The Hadoop project doesn’t offer one. Many of the guides out there say “get a community-built version”, without saying where to find one. The most official one (which I linked above) is provided by a Hadoop contributor, but he stopped providing updates years ago, with Hadoop 3.0. He links to another repo, which stopped providing updates years ago, but at least go to Hadoop 3.3. There’s nobody apparently providing binaries for Hadoop 3.4, which is what’s bundled with PySpark 4.x (current version is 4.1.1 as of the time of writing).

Liars, all of them. Liars!

Maybe it was needed once. Maybe it’s still needed for some things. But at least with PySpark 3.5.8, it works just fine to load up a Parquet file and do queries. As long as you’re using Java 17 or Java 21, anyway. You get a runtime error with Java 25 – which, for some reason, the devs who looked at this problem had updated to. Unnecessarily, mind you – the Java apps we build for IES are all based on Java 17 (to be updated to Java 21 later this year).

The biggest liars here, though, were the LLMs that my teammates were using to help them figure out what was going on. The Google AI summary, Chat GPT, Copilot, Gemini – all had been trained on those same articles, and all insisted that you need Hadoop Winutils (without saying where to get it). It even sidetracked me for a day (because I spent time replicating the research done by my coworkers to reproduce the problem). It wasn’t until I decided to ignore that research and start from scratch did I discover the problem.

Takeaways

Don’t overcomplicate things.
Don’t trust LLMs – they are always out of date.
Unit tests rock. Don’t be lazy and find excuses not to test as much of your code as possible locally.
There’s always a way to run more locally.

Is this thing on?

For reasons that should be obvious, I’m just making sure that my blog is still alive. Moribund, but alive…

JobSeeker and JobActive are meant to move people off welfare and into work, but COVID-19 has changed things – ABC News

The privatisation of the employment system is one of the classic examples of failure.

Private employment agencies work well as an opt-in system – you want a job, you go to an agency to help find you one, and if you’re one of the lucky ones, they get you one. It works well for people who are “highly employable”. This system is how I’ve found nearly every job I’ve had, for example.

But it doesn’t work for a lot of people. If you are not “highly employable” – that is, if you don’t have any particularly in-demand skills, and are applying for so-called ‘unskilled labour’ jobs – the private system is full of holes. In particular, they have no incentive to do a lot of work to find candidates a job. They have incentive to do some work – they get paid – but there is no penalty for not finding a job. It’s built on an underlying assumption of “full employment” – that there is a job available for anyone who wants one, and thus the only reason for not finding a job is due to some failure on the part of the job seeker.

The thing is – that’s just not the case. Well before this so-called “COVID recession”, we were experiencing the “jobless recovery” from the GFC (which, BTW, we never recovered from). Headline unemployment was “low” at ~5% give-or-take – which is the goal that the Coalition. They don’t want unemployment really low, because it puts pressure on wages, and they see the ultra-low wage growth of the last few years as a good thing. But there were still plenty more applicants than jobs – about 8 in most cities, up to 30 in rural area or amongst youths. It was quite possible to be an active job seeker, willing to take on any reasonable job, and not be able to find work – the jobs weren’t there.

The private employment services responded to this the way any for-profit business would: they placed people to jobs if the jobs were available, and that was it. Yes, they would try and find more employers – but it wasn’t looking for a job for you. You’d just go into the bucket (or, if you were lucky, into a queue), and if your name was drawn out, you’d be sent off for an interview (again, one of many for the job), and then you’d roll the dice to see if you got that job. If you didn’t, back in the bucket you went.

Furthermore, once you’ve gone through that process a few times, you’re now costing the service provider more than they will make from you. So you get moved from the bucket of “employable types” and over to the bucket of “unemployable types”. You won’t get as many interviews, and even when you’re sent on the, the employer knows that you’re in the unemployable bucket, and you’ll get a pro-forma interview and circular filed afterwards. There’s just no incentive to find work for people who, through bad luck, end up in this category.

The simple fact is that there is no penalty for employment agencies to fail to find work for their clients.

The Centerlink system of the 80s and 90s had problems; a centralised system for all unemployed people to go through was inefficient. But it did have a goal on reducing the number of long-term unemployed – a goal that is given lip-service at best these days.

Like many situations where a government service has been outsourced, the failure mode is that the easy, profitable cases are being well-serve, and the loss-making ones are being discarded. But these are people’s lives and livelihoods being discarded.

However, while the myth that anyone who wants a job can get a job remains prevalent, this will continue.

Source: JobSeeker and JobActive are meant to move people off welfare and into work, but COVID-19 has changed things – ABC News

ExtJS, Ajax, and Error Handling

An example of configuring a global AJAX exception handler

AiL – Subnets! Public Subnets! Wow, they are free, and we can au-to-mate!

(to the tune of the Ewoks Victory Song. Now it can be stuck in your head, and not just mine)

Picking up from last time, I needed to start creating things inside my Virtual Private Cloud, or VPC. The first things to create are subnets – public subnets, in particular. Without a public subnet, nothing that I run in the VPC can be accessed from the internet – nor can they access the internet in turn.

Continue reading “AiL – Subnets! Public Subnets! Wow, they are free, and we can au-to-mate!”

AiL – Baby’s First CloudFormation Stack

As previously discussed, I’m doing a small series of posts around bringing the AWS infrastructure that I use into the current era, and putting it all into CloudFormation. In this post, I’m going to cover setting up the first stack. This is going to set up a Virtual Private Cloud (or VPC), which is where the rest of the stuff I make later will sit.

What’s a Virtual Private Cloud?

A VPC is a virtual network of virtual servers. It’s your own mini-slice of the AWS cloud, and the machines within the VPC are aware of each other – in fact, they are on their own subnet (or subnets).

Why use a VPC?

You don’t have to set up a VPC to use AWS. You can simply create servers. That’s what I’ve been doing up until now. It’s just that it’s a bit limited.

I want to use a VPC for two big reasons:

I want to be able to use more recent/powerful/cheaper machine images, with OpsWorks. They’re only available if I also use a VPC.
I want to use an Elastic Load Balancer, in part to manage HTTPS certificates and connections. This requires a VPC and subnets.

Setting up the Stack

Here’s my config, at this particular stage:

	---
	AWSTemplateFormatVersion: '2010-09-09'
	Description:
	Global configuration that could be used by multiple related stacks

	# Metadata: # no metadata

	Parameters:
	Environment:
	Type: String
	Description:
	Stack Environment Prefix.

	Resources:
	# We need at least one resource. The VPC is the logical one to include here.
	VPC:
	Type: AWS::EC2::VPC
	Properties:
	# The range of ip addresses used in the VPC. Subnets will be inside this.
	# Using the /16 lets me easily have subnets that don't overlap without
	# needing to remember bit masks.
	CidrBlock: 10.0.0.0/16 # 10.0.0.0 -> 10.0.255.255
	EnableDnsSupport: true # If false, the servers don't seem to get access to DNS at all.
	InstanceTenancy: default
	Tags:
	- Key: Name
	Value: !Sub "${Environment} VPC"

	Outputs:
	VPC:
	Description: The ID for the Virtual Private Cloud; needed by more or less everything.
	Value: !Ref VPC
	Export:
	Name: !Sub "${Environment}::VPC"

view raw Globals.yaml hosted with ❤ by GitHub

I put the VPC into its own file, because I don’t want to delete it when tearing down an environment for the sake of testing. There’s lots of things that get annoying to re-create if the VPC is changed (cough OpsWorks stacks) – so we put the VPC in its own file. (Later, when I bring in nested stacks, this will not be in the nested stack).

Break It Down

First, note that I use YAML for my CloudFormation files. I’m not a huge fan of YAML in general, but the JSON option doesn’t allow you to use comments, and comments are essential. (Sidebar: when parsing JSON, always enable comments. It’s non-standard, but it’s useful). Using YAML also lets me use a more convenient shorthand for accessing some inbuilt functions (the Sub one is used here). I strongly suggest you do the same.

The Parameters block provides me with some configuration options. It allows me, if I want, to create different instances of the stack. In this case, I use an Environment parameter. This particular parameter is common to all my stack files, and I use it to separate test stacks from the prod ones. (I could also do this with AWS sub-accounts)

The Resources block specifies what this file provides. Here I set up the Virtual Private Cloud.

Finally, in the Outputs section, I export the VPC reference out, so that other files can link to it.

AiL – Exploring AWS CloudFormation

I’ve been an active user of AWS for over six years now. The application I work on – a vertical niche app for the mining industry – is hosted there. But as the lead developer – and sole DevOps type – of a small team with an aggressive development schedule, I haven’t ever managed to explore CloudFormation properly, or even organise our servers properly. The most I’d managed to do to keep them organised and under control was to use OpsWorks and Chef to manage configuration and deployment of our servers.

This Christmas period, however, we brought in a number of university students on work experience programs. Most of them went to help the mining engineers on their research programs – but I got one to help me tame our AWS environments. He did a lot of background research on AWS CloudFormation for me, and this week I’ve taken the time to put it to use. I’m going to write a few posts covering some of what I learnt – not just about CloudFormation, but how to configure networks ‘properly’ in AWS – for my own edification and later reference. If that helps some reader as well, I can live with that. 😉

What is CloudFormation?

Put simply, CloudFormation is a way of managing textual descriptions of AWS resources, and a set of tools to take those descriptions and configure AWS resources – such as servers, networks, and load balancers.

Being text-based, you can put the files into source control (very important!). With the tooling, you can do things like:

create new ‘stacks’ of resources quickly
apply changes to the stacks in a consistent fashion
test configuration changes prior to applying them to your production system
automatically check for ‘drift’ (changes to your infrastructure that isn’t in the files)
automate the above using a build server

You know, all the good sensible adult things you should do with a production system that your livelihood depends on.

What’s the problem I’m solving

First – the app I’m working on isn’t a massive resource hog. It’s an industry-vertical niche web-based engineering app, with less than a thousand active users as a target (we aren’t there yet). It doesn’t need lots of servers, or to cope with typical load patterns. In fact, a lot of the time the app just sits idle, with nobody using it. When they do use it, that use can be fairly sporadic (a few requests per hour) – or they can do large-scale batch jobs that we have to bring up a hundred+ servers to manage. In other words, I’m not looking to solve problems common to consumer-facing sites.

I need to manage approximately 10 servers, each running a slightly different configuration of our app. This includes test environments, a ‘public’ demo site, and then several instances that run the same version but include different run-time plugins based on customer needs. (These plugins model different types of mining equipment, some of which are proprietary to the individual customer). The test environments get updated regularly, while the customer sites get updated roughly quarterly, as feature sets get finished. We don’t need staggered deployments that can be done with no downtime – but we do need to have a controlled process.

We also practice multi-tenancy, at least for customers that don’t have proprietary models. This is done by host-based routing.

What’s my goal here?

Well, besides simply putting as much of my infrastructure into CloudFormation as possible, I also want to improve my network layout to be more secure. So I’m also learning about private & public subnets, load balancers, and the things you need to do to make that all work. This is going to be a series of posts tackling such things as:

Setting up your first CloudFormation Stack – a simple VPC
Adding public subnets
Creating private subnets, with a NAT instance (and bastion server!)
Adding a Load Balancer
Managing multiple files via Nested Stacks
Adding OpsWorks stacks
Autoregistering OpsWorks-managed instances into the Load Balancer
Working with Batch Environments

I’m not going to get all of my infrastructure into CloudFormation. Some of it you can’t (e.g. SSH keys). Some of it already exists outside of CloudFormation, and I don’t want to destroy it just to recreate it (e.g. my Route53 Hosted Zones). Some of it I just won’t get around to at this time. And some of it I do manage I won’t want to write up. But there’ll be a lot of notes here, and it’ll be in a generally agnostic approach.

I can’t promise that the advice here is “good”. In particular – I’m not a network guy. I know enough to get me in trouble, and some decent rules-of-thumb, but there may be some obvious mistakes here. All I can say is that this is what I’m doing, and it’s hopefully working for me.

I’ll update this page with links to the individual articles as they get written.

AWS Lambda finally supports Ruby, shared libraries

So here’s my favourite bit of news out of Amazon for today – they’ve finally got Ruby support in AWS Lambda, AND they’ve got support for shared libraries.

I’ve got a metric tonne of scripts for dealing with AWS environments in Ruby that I’ve written over the last six years of working on IES, mostly to organise data and application deployments. I’ve been wanting to put a web interface over them – to make it easier for other people in my team to use them – for some time, and even to make an iOS app for it, but I also didn’t want to run a dedicated backend server for a back-office app that I would use maybe a couple of times a week.

AWS Lambda is the perfect solution for building a backend for backoffice-style apps. But the thought of rebuilding these scripts as Node or Python was too much. Looking forward to trying this out!

Of course negative gearing benefits go mostly to the already wealthy… they can afford it

ABC facts checks claims about negative gearing

Negative gearing is a system where the rent earned on a property – after removing other expenses – doesn’t cover the interest on the mortgage. You can then take that loss, and use it to offset other taxable income. It’s a system where you deliberately run what is a effectively a business at a loss, hoping to offset that loss later by selling the property and pocketing capital gains (which will also be tax discounted)

Continue reading “Of course negative gearing benefits go mostly to the already wealthy… they can afford it”

So you want reliable power?

Want reliable power? Don’t bank on coal says the AEMO

Hmm… one of the key arguments made by the LNP and their supporters is the need for coal (and, to a lesser extent, gas) power stations to provide reliability. Even their nod at a renewables option – the Snowy 2.0 project – is about providing reliable baseline power.

So what does the Australian Energy Market Operator say? “Extreme weather over summer could reduce the output of COAL, GAS and HYDRO power generators and cause problems with the reliability of electricity supply”

Continue reading “So you want reliable power?”

	martin Schönert on DTSTTCPW – What does it…
	Robert Watkins on DTSTTCPW – What does it…
	martin Schönert on DTSTTCPW – What does it…
	William on Using shorter UUIDs
	Thiago Romão Barcala on Using shorter UUIDs

Longer Version

Story Time

Problems with Winutils

Liars, all of them. Liars!

Takeaways

Share this:

Share this:

Share this:

Share this:

Share this:

What’s a Virtual Private Cloud?

Why use a VPC?

Setting up the Stack

Break It Down

Share this:

What is CloudFormation?

What’s the problem I’m solving

What’s my goal here?

Share this:

Share this:

Share this:

Share this: