Signal Classification: Powerful Patterns from Simple Features
This is the third in the Series of SETI project postings. SETI sparks Machine Learning to sift Big Data and Types of BigData from the Allen Telescope Array are available on the IBM Emerging Technologies Blog.
IBM jStart team has partnered with the SETI Institute to develop a Spark application to analyze the millions of radio events that have been detected by the Allen Telescope Array (ATA) over the past decade. Two types of data have been stored, as explained in the previous post about the ATA database.
Complex Amplitude Files
This article focuses on just one of these data repositories: the complex amplitude files, (CompAmps) which record signals detected by the ATA in the 1GHz to 10GHz range. The SETI Institute has archived over 15 million of these files, each of which captures a specific signal event by recording the raw voltages across eight 500 MHz channels for 93 seconds.
The most common way to visualize the data in a CompAmp uses the Fast Fourier Transform to produce spectrograms, often called waterfall plots, like the one shown here.
The horizontal axis is frequency, and the vertical axis is time. You can imagine the image streaming upward as a signal is being recorded, with the lower edge being continuously populated with new pixels that correspond to the power of the signal across the frequency range, from left to right.
In this way, the brightness of each pixel in the image corresponds to the power of the signal at that particular frequency at that particular moment in time. In this case, the bright line of a narrow band signal, which has a slight slant due a slow increase in frequency over time, jumps out from the background noise.
The question is, how can we scale up our examination of over 15 million of these images so that we can confidently eliminate human-generated radio interference and focus on the anomalous outliers that warrant further examination?
A myriad of techniques come into play when eliminating radio frequency interference (RFI), such as discarding signals which are detected simultaneously in several directions – the logic being that if all three of the ATA’s independently directed observation beams were swamped by the same signal, then it was almost certainly generated by something nearby (e.g. aircraft).
This article focuses on additional techniques that are being developed for this project to further classify the signals so that we can focus our subsequent analytic efforts towards the small subset of CompAmps that are deemed to be of interest.
Dr. Jeff Scargle, in the Astrobiology and Space Science Division of the NASA Ames Research Center, has been collaborating with the IBM jStart team on this project as a part of his broader research efforts at NASA Ames. Through his collaboration with IBM Research, some promising mechanisms are now emerging to help “triage” clusters of waterfall plots for further analysis. Furthermore, this clustering is based on surprisingly simple features of the waterfall plots, which makes the process highly scalable for bulk processing with Apache Spark… an important factor given that we have 15 million CompAmp files waiting to be processed.
His current investigation emerged from the consideration of various quantitative measures derived from the waterfall plots, such as:
- Examining the distribution of the mean intensity of thousands of waterfall plots, corrected for background noise.
- Projecting waterfall plots onto the frequency axis and computing the variance along that dimension – the intuition being that well defined, strong signals will contribute large values to the variance.
- Performing a similar computation of variance, but this time projecting the waterfall plot onto the time dimension.
In fact, it was by examining the variance of intensity when projected both onto frequency and time, that Dr. Scargle determined that these two simple quantitative measures could be used to segment signals into useful classifications.
This is highlighted in the data plot at right, creating from just 66,000 out of the 15 million waterfall plots from the ATA archives.
An excerpt of simple, non-parallelized Python code that can be used to generate this plot is shown below.
1 # file_list = list of waterfall image png files to analyze and plot
123456 import numpy as npimport matplotlib.pyplot as pltimport matplotlib.image as imgimport pandas as pdstd_time_array = std_freq_array = 
123456 for file in file_list:image_this = img.imread( file )std_time = np.mean( np.std( image_this, axis = 0 )std_freq = np.mean( np.std( image_this, axis = 1 ) )std_time_array = np.append( std_time_array, std_time )std_freq_array = np.append( std_freq_array, std_freq )
1234 plt.plot( np.log10(std_freq_array), np.log10(std_time_array), 'b.',label='Waterfall Parameters' )plt.xlabel('std (frequency) ')plt.ylabel('std (time) ')plt.show()
Zooming into specific regions of this scatter plot revealed further sub-structure within the main clusters. Moreover, an exploration of the waterfall visualizations that correspond to specific points in the scatter plot revealed some clear and useful and similarities between the signals within certain clusters.
“Preliminary studies suggest that the combination of standard deviations of both the time and frequency projections of the waterfall plots is very diagnostic for not only the presence of signals but the morphology of the trace in the waterfall plot. The resulting scatter plot demonstrates the remarkable mapping from this simple parameter space to useful classes of the character of the waterfall plots.”
Dr. Jeff Scargle, NASA Space Science Division
Manual and anecdotal examination of the waterfall plots associated with specific points appears to indicate that certain regions contain an unusually high density of certain types of signals, as noted on the chart to the left.
These are just initial findings derived from an informal exploration of the data. However, the results provide an encouraging indication that untapped information awaits us, deep inside the substantial data repository of the ATA archives.
Next steps could include determining if other observational metrics, such as Doppler drift, can be used as a third dimension to spread out the over-plots and reveal even more structure to the signal clusters.