Types of BigData from the Allen Telescope Array
This is the Second in the Series of SETI project posts.
There are two types of data from the Allen Telescope Array (ATA) that are being analyzed using IBM Apache Spark as a Service:
168 million rows of relational data that records each signal detected over the last 10 years (SignalDB) capturing information such a sky coordinates, signal frequency doppler drift, signal type (carrier vs. pulse), etc. The plan is to access this using SparkSQL to drive a wide range of interesting analytics.
A typical row from the SignalDB data looks like this:
20 million complex amplitude files (CompAmps), which are binary files that record the actual signal that was detected. For this project, we will only load CompAmps for signals detected from Jan 2013 onwards. Each CompAmp has multiple channels encoded within it. Each of these can be visualized using techniques such as Fast Fourier Transforms to create spectragrams (waterfall plots) that show the frequency (left to right axis) and amplitude (or signal strength-the brightness of the pixel) over time.
The Waterfall Plot
A typical waterfall plot generated from a CompAmp looks like this. This is a waterfall plot being displayed in an IPython notebook, which is the interactive analytic interface being used for the IBM Spark service. It shows a faint diagonal line on the right… that is the signal detected by the ATA and recorded in a CompAmp file (along with other data in the SignalDB database).
That diagonal shift is called Doppler Drift and indicates that the frequency of the signal is changing over time due to an acceleration or deceleration of the radio signal emitter relative to the ATA telescope. This is the same thing that causes a fire truck siren to change frequency as the fire truck races past you (higher pitch down to lower pitch), except we are dealing with radio waves instead of sound waves. The change in velocity of the radio source emitter relative to the ATA receiver can be due to the rotation of the earth, the orbit of the earth around the sun, or movement of the emitter (such as a satellite or aircraft moving overhead).
On the subject of planes and satellites, the fact is that Radio Frequency Interference (RFI) is the number one challenge for the SETI program. There are millions of signals detected every year, and almost all of them are immediately explained by human-created RFI: planes, local radio, radar, satellites, even microwave ovens being opened (yes, really… a microwave oven was a recent problem at another radio telescope, and it took them months to figure out what was going on). Discovering new ways to eliminate RFI without discarding potentially interesting signals that demand more scrutiny is a major aspect of ATA operations.
This project aims to attack the RFI problem in novel ways that are only now becoming viable. For example, by sweeping through all the SignalDB records and applying complex calculations to “correct” the Doppler drift by eliminating the effects of the earth’s rotation and orbit. Put another way, since we know the exact time and date when a signal was detected, and we know the exact direction the telescope was pointed, it is possible to compute the acceleration we would expect the earth’s rotation and orbit would cause relative to a distant radio emitter. Once you subtract that effect, you are left with just the acceleration of the emitter, which may be a useful feature for classification and other machine learning algorithms: signals from aircraft, satellites and ground-based radar will all have drift values and other features that will tend to group them into clusters. Conversely, outlier signals which are not within these known RFI segments might need a second look.
These are complex calculations that need to be applied to tens of millions of signals, and exciting new territory for SETI signal processing. This is a perfect fit for the distributed analytic capabilities of Spark.