InterPlanetary File System (IPFS) on Jupyter
The fundamental concept of IPFS is that instead of looking for locations, as with HTTP, you look for the content of a file. (erisindustries)
When you say Blockchain, Git, BitTorrent, I hear: Directed acyclic graphs (like git) with hashed hierarchical checkpoints (like a blockchain or merkle trees) distributed peer to peer (like bit torrent). This is literally what IPFS is, and using those terms is one way to describe it. (HackerNews Comment)
As a problem solver at heart, I like to find problems. My journey within the data science experience has enlightened me to this domain’s unique challenges. Cloud Solution development and data science are clearly different beasts – both in development thought process and skills+tooling. Often, I feel like I’m holding up a sign at the corner of data science and emerging technologies that says,
“Can you spare some open data?
I need less friction in my data science experience” 🙂
An alpha technology, InterPlanetary File System (IPFS) recently piqued my interest – both for its future potential and its power today to help ease my data science pains. In honor of the upcoming Mission Juno which seeks to unlock Jupiter’s secrets, this is my story of exploring the InterPlanetary File System on Jupyter ….. Notebooks.
There is an amazing trove of existing articles and content discussing IPFS in great technical and strategic detail. For example, how it might one day usurp the role of HTTP as the lingua franca protocol of the World Wide Web or how the decentralized web can reduce the risks of censorship and content loss caused by server failure and decommissioning. This blog post is not one of them. Rather, my goal is to simply share an approach for installing and configuring the IPFS client within a Jupyter notebook and share my thoughts around how this technology could meaningfully impact the data science experience in interesting ways. If you are inclined to hear the hype or want to understand the technology deeper and be inspired by the implications, then check out my collection of references below.
Setting up IPFS on Jupyter
I have created a sample notebook which illustrates the following concepts:
- Installation of the IPFS Go Client binary and its necessary pre-requisites into a Python Jupyter Notebook
- Provisions for a Python helper class with common CLI methods to simplify IPFS usage within a notebook
- Validation of BitTorrent Swarm connectivity
- Proof of Concept content fetching for ~2Gb and ~8.2Gb gunzipped Open Library data files
- Proof of Concept content fetching of images and ipfs network HTTP resources
- Pinning/Adding of local content for further propagation within the IPFS network
Check it out!
Do I dare
Disturb the universe?
In a minute there is time
For decisions and revisions which a minute will reverse.
~ T.S. Eliot ~
- IPFS as a technology is young and brimming with potential. Alpha means YMMV, but the project is open and evolving rapidly.
- Posit: Can IPFS provide a more frictionless experience to accessing data?
I think the answer is “YES“. As a hypermedia protocol focused on content rather than locations, the data science experience can be enriched.
- I have encountered JDBC Driver implementations that assume and require data files to be located on a local file path. IPFS’s FUSE mount capabilities are appealing because they allow IPFS paths to mimic as local file paths.
- Centralized open dataset exchanges offer a portal for discovery and download of curated content. Unfortunately, they also surface unwanted friction through higher latency for long-distance low bandwidth clients and single point of failure risks. Instead, these exchanges could embrace the resilient characteristics of content decentralization by participating on the IPFS network and surfacing resource hash paths that reflect the content itself – uncaring of where it is hosted. For example, a specific piece of content could be embodied in a path as simple as this:
The redundancy of a distributed mesh network would ensure the quickest access times and constant availability for the most popular content. Let’s have exchanges that curate data via a focus on content rather than location. Analytics open data exchanges that view content hosting as their differentiator will eventually be commoditized by the cloud. Easy discovery and curation of content metadata (e.g. Usage rights, common analytics use cases, recommended complementary datasets, …) will be the true value proposition and sought after by consumers. Portals should differentiate on how they educate data consumers rather than what content they host.
Analytics open data exchanges that view content hosting as their differentiator will eventually be commoditized by the cloud.
- Posit: Can IPFS facilitate an easier way of sharing code snippets and data analysis solution patterns between Jupyter notebook users?
As I’ve assembled notebooks for different projects, I frequently find myself re-using analysis approaches as well as integration and configuration code snippets. For example, I use helper methods to connect to Blob Object stores and external databases. IPFS provides a nice facility to cat the content of a network resource.
1ipfs cat /ipfs/QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG/readme
Because an IPFS resource path is hashed to its content, content from a specific path is essentially version controlled. Imagine maintaining a list of best-of-breed code snippets that you can easily load at any time with a single call, thus reducing my friction so that I can focus on the domain and interpretation.
- Posit: Are there opportunities and advantages for real-time analytics and Spark streaming used in conjunction with the streaming capabilities of IPFS streams?
It is an intriguing thought to consider large dataset resources being streamed via an IPNS mutable pointer into Spark streaming. A possible advantage is early partial analysis on large data sets (“headlights”).
- Posit: Are there opportunities and advantages for triggering “serverless” lambda functions (e.g. OpenWhisk, AWS Lambda, …) via an IPFS client, causing parallel content manipulation programmatically (e.g. spin up lambda, fetch content and put it in a blob store)?
I see opportunities for data automation via this technology. When a request is made for an IPFS resource, the consumer is indicating interest in that content. That interest could in turn automate workflow processes on that same content. Since the resource paths are content focused, we can have confidence in assumptions that:
- the automated processes are working on exactly the same version of data that was originally requested
- the data is being retrieved in the most efficient way possible by the automation
- the automation process has a higher chance of success because data access is less susceptible to server domain failure.
Food For Thought
- How does IPFS decentralized technology compare to other similar open projects such as DAT, MaidSafe and Storj.io?
- IPFS is a technology focused on distribution, not storage, of content. It is a peer-to-peer hypermedia protocol.
- What use cases does IPFS make better? In what cases is P2P unwise or unnecessary?
- Did you notice that some of the embedded images within this blog are being served via an IPFS resource url hosted via HTTP @ ipfs.io?
- What next ? I’d love to hear your thoughts. Let’s have a conversation. Look forward to your comments below.
- IPFS + Jupyter Notebooks Github Discussion Issue
- IPFS Video Intro
- IPFS Stack Image linked from: https://hackaday.io/project/5077-metaverse-lab/log/36232-the-wired