Last week I had the opportunity to listen to a very interesting return of experience on the use of shared Flash technology for Big Data, and more specifically Hadoop workloads.
It is fascinating (yet not unexpected) to see how Flash has evolved and matured from application acceleration, to general transactional workloads and now to Big Data. The latest does break two pre-conceived ideas about Hadoop workloads (they are better suited to Direct Attached Storage and they run well on dense spinning drives).
The experience today is that not only low latency evidently has a positive impact on those workloads, but the shared model lifts many design complications inherent to the hyper converged approach. Not only storage protection is offloaded to the backend (freeing up CPU cycles from the compute layer), but it can also be grown asymmetrically – which is very consistent to the way data lakes get formed.
Does this mean that we should start deploying All Flash Data lakes ?
Probably not (at least not for now). Although we are seeing a number of options popping up, the reality is that all data does not have the same temperature at all times. Not only that, but we are also seeing more and more different flavours of Flash (TLC, 3D NAND, XPoint), all coming quite rapidly to the market and all yielding increasing performance and cost benefit.
Enter Software Defined Flash and Unified File Object
IBM announced last March a partnership with HortonWorks to support Spectrum Scale to run Hadoop workloads. While not the first time Spectrum Scale is used for Big Data obviously, it is a great demonstration and testimony of what SDS and a Unified File Object can deliver together.
There are 3 key elements I believe are crucial:
- Flash, any flavour at any time – the Software Defined Approach from Spectrum Scale allows us to accommodate and leverage ongoing and constantly evolving Flash technologies, as well as combining the together.
- Hot, lukewarm and cold data – it probably does not make sense to store dormant data on Flash, yet it can be complicated to move it out to other media, object or cloud. Spectrum Scale is designed to automatically handle those different media, and make it totally seamless to the application and end-user.
- In-place analytics – most importantly, because Spectrum Scale is a Unified platform it frees us up from the process of 1/ collecting raw information 2/ transferrring it into the Hadoop environment 3/ exporting it out to publishing. All is created once and no data needs to be transferred or moved.
The key takeaway of all this is that going fast is good, going fast AND smart is what we should aim for.
If you wish to further drill into the topic, I recommend the links below:
- 10 cool things you can do with IBM Spectrum Scale Unified File and Object storage
- Analytics for Object Storage Simplified (from which graphics in the article are extracted from).