I had all sorts of problems getting my head around how cache files work with Hadoop. Finally, I stumbled across the answer – when you add a cache file (see HadoopMain#48), it’s available to read as a local file inside the mapper (MyMapper#36).
When running in Elastic MapReduce, the file URI can be an S3 file, using either s3://bucket/path or s3n://bucket/path – this may or may not work in other Hadoop implementations, but the general approach would work fine.
See the gist at https://gist.github.com/twasink/8813628
Author: Robert Watkins
My name is Robert Watkins. I am a software developer and have been for over 20 years now. I currently work for people, but my opinions here are in no way endorsed by them (which is cool; their opinions aren’t endorsed by me either). My main professional interests are in Java development, using Agile methods, with a historical focus on building web based applications. I’m also a Mac-fan and love my iPhone, which I’m currently learning how to code for.
I live and work in Brisbane, Australia, but I grew up in the Northern Territory, and still find Brisbane too cold (after 22 years here). I’m married, with two children and one cat. My politics are socialist in tendency, my religious affiliation is atheist (aka “none of the above”), my attitude is condescending and my moral standing is lying down.
View all posts by Robert Watkins