To Swiftly Go: An Adventure in Terabytes
We’ve been working with the SETI Institute to analyze massive quantities of radio data generated by their telescope array. Each telescope observation generates multiple 2.5 Terabyte files in just a few hours, and they want to use cloud-based analytics tools like Apache Spark to gain insight from it. A prerequisite for using cloud analytics on a dataset is, naturally, that the data needs to be in the cloud. This isn’t generally a problem with datasets smaller than a few gigabytes, but the Terabyte scale is a different ball game.
In Bluemix, we use OpenStack Object Storage to store big data. It’s a scalable, sharded solution for fast data access. It does however, have certain limits on data uploads. A single file upload can be no larger than 5GB. This limitation makes working with big data seem impossible, but is actually important for making big data practical in a cloud environment. Rather than uploading a single, massive file, the Object Storage paradigm is to upload many small files that together form a large one. Once you have all of the pieces, there is a simple construct called a Static Large Object (SLO) that allows you to easily pull them all together into the original file. The benefit of this is that the different chunks of the file can be served by multiple cloud servers, resulting in better read times than trying to read all of the data from a single source.
(We’ve blogged about SLOs and their cousins, DLOs, before. You can check that post out for a more detailed explanation.)
Before deciding to write a custom solution we looked at existing technologies for uploading data to OpenStack Object Storage. CyberDuck has the ability to communicate with OpenStack Object Storage, but its 19-hour 2.5TB uploads would periodically fail in the middle and necessitate a full 19-hour retry. There is already a python utility for communicating with OpenStack Object Storage, but we’ve experienced problems configuring it in the past and we worried about its efficiency at this scale. We quickly realized that we needed a solution that could pick up from where it left off when it failed, rather than one that made us start over from the beginning.
We needed to create an SLO out of the 2.5TB files. In order to do that, we needed to upload all of the data from the files as a set of small file chunks, each of which would represent a multi-gigabyte region of the file. Once we got all of the chunks up there, we could make an SLO that listed them all in order and access them as though they were a single file.
To get an idea of what we were trying to generate, here’s an example chunk of an SLO manifest file (the list of file chunks):
"path":"container-name/object-name", //path of the object
"etag":"d3b656f622f99f45ab8c18d3a5f052ec", //hash of the object
"size_bytes":500000 //size of the object in bytes
...more chunk listings
The ‘etag’ in the above JSON is simply the md5 hash of the file’s contents, and is used to verify that no data was corrupted in transit.
To implement the above, we needed the ability to easily create chunks of the data from a file and to manage many uploads. OpenStack Object Storage implements an API called Swift (not to be confused with the Swift Programming Language, which is unrelated) that makes it easy to upload data. There are bindings for Python, but also Golang. We decided to build our utility on top of the existing Golang Swift Client because we felt that Golang’s emphasis on concurrency would be a good fit for managing our many file chunk uploads.
After a week or so of development, we had a prototype ready to go. We kicked off a data upload and sat back to watch the fireworks. It estimated a 12-hour total upload time, so we walked away and worked on other things. Twelve hours later, we checked back only to discover that it had crashed and died after the first Terabyte was done. We retried it, which resumed uploading from where it had left off, but it crashed at exactly the same place. We had some ideas, which led us to build…
We implemented the ability to completely skip the uploading of problematic chunks (the ones that caused the first upload to fail, see below for why), as well as a few other niceties. We tried the upload again, and it worked well until encountering another region that caused serious problems. During our investigation of that region, we discovered that the hard drive that we were reading had some bad sectors that were completely unreadable. Fortunately, this only occurred in two places. After skipping both, we were able to complete the upload.
We patted ourselves on the back and walked away, assuming that we were done. Unfortunately, things are never that simple. We tried to analyze our data with Apache Spark, but discovered that its handling for loading binary data doesn’t overflow to disk when the file being loaded won’t fit in memory. We had file chunks that were too big to fit in memory, and thus we couldn’t analyze them.
A bizarre limitation of SLOs is that they can only reference, at maximum, 1000 files. We needed to create 1GB chunks of a 2.5TB file, which meant 2,500 chunks. This meant that we needed multiple SLOs to reference all of the chunks, and then an SLO that pulled the other ones together into the original file. This wasn’t within the capabilities of our original implementation, so yet another refactor was in order.
We had just checked out this excellent blog post by the Golang Blog that discussed the pipeline software architecture in Go. We realized that our problem fit nicely into this pipeline paradigm, and we refactored accordingly to create the following pipeline:
- Generate empty file chunks from the chunk size and file size
- Copy data from the file into the file chunks
- Ready the chunks to be uploaded
- Upload all of the chunks
- Generate manifest files for all of the chunks
- Generate manifests of manifests (if necessary)
- Upload manifests
We implemented each pipeline stage as a function that accepts and returns channels of a struct representing a chunk of files. These functions can be applied in many different orders to create pipelines that perform different tasks. This architecture is much more flexible than our previous implementation, though it isn’t quite finished at time of writing.
We haven’t yet been able to try our new implementation against the 2.5TB files that our original code was hardened against, but we’re excited by its potential. This post will be updated to reflect the results of our testing when we get the chance.
Why are we telling you all of this? We’ve open-sourced the library that does all of this work! You can find it under the name swiftlygo (yes, Star Trek reference) on GitHub. We’ve also exported all of the pipeline logic so that it’s easy to create your own data pipelines that fulfill your big data upload needs.
Iterating on this design reinforced a lot of agile principles. We thought that the design was done twice, but it needed adjustment. Who knows, we may need to pull another refactor before the code is where we need it to be. The whole point of the agile methodology is to recognize that you don’t know your requirements when you start out, even when you think that you do. We didn’t realize how many different ways our data upload could fail, nor how small our data chunks would need to be to be analyzed by Spark. We also didn’t know how nicely our process fit into Golang’s pipeline architecture. Developing iteratively gave us the freedom to adapt to these requirements as they emerged, rather than trying to think of them all from the beginning.
We hope that by releasing the results of our efforts as an open-source Golang library, you will be able to leverage our work to create your own upload utilities and pipelines. Please let us know if our code is useful to you!
Image credit: Renee French