I had all sorts of problems getting my head around how cache files work with Hadoop. Finally, I stumbled across the answer – when you add a cache file (see HadoopMain#48), it’s available to read as a local file inside the mapper (MyMapper#36).
When running in Elastic MapReduce, the file URI can be an S3 file, using either s3://bucket/path or s3n://bucket/path – this may or may not work in other Hadoop implementations, but the general approach would work fine.
See the gist at https://gist.github.com/twasink/8813628