Example Hadoop Job that reads a cache file loaded from S3

I had all sorts of problems getting my head around how cache files work with Hadoop. Finally, I stumbled across the answer – when you add a cache file (see HadoopMain#48), it’s available to read as a local file inside the mapper (MyMapper#36).

When running in Elastic MapReduce, the file URI can be an S3 file, using either s3://bucket/path or s3n://bucket/path – this may or may not work in other Hadoop implementations, but the general approach would work fine.

See the gist at https://gist.github.com/twasink/8813628