# IBM and Stanford University team up for a new perspective on SETI signal analytics

Co-authors: Frank Fan, Kenny Smith, Jason Wang, Austin Hou, Rafael Setra, Qi Yang

IBM and students from the Stanford University have teamed up to use IBM Spark services to analyze astronomical radio signal data for their projects in Mining Massive Data Sets.

Using IBM’s Spark@SETI environment in the IBM Cloud, two different teams tackled the complex problem of signal feature extraction and classification using several terabytes of radio signal data from the Allen Telescope Array in northern California.

The objective was to see if the two Stanford teams could use IBM Spark services to take a fresh approach to signal analysis and open the way for improved machine learning models to rapidly classify hundreds of thousands of signals in the SETI Institute’s archives, as well as new signals as they are detected by the ATA.

The result of this collaboration with Stanford was highly successful, and innovative new algorithms and methodologies have been developed that are now being assessed by scientists from IBM, the SETI Institute and NASA for further use and enhancement. In this blog, we outline the concepts and algorithms used by the Stanford teams, and review some of their findings.

**The Allen Telescope Array**

The Allen Telescope Array (ATA) uses sophisticated software and hardware systems to perform real time detection of narrow band signals. Various tests are performed to eliminate obvious human radio frequency interference (RFI) to focus on signals of interest – called “candidate signals”. These candidate signals are stored as complex amplitude (compAmp) files for later analysis.

The Stanford teams had access to over 360,000 of these “candidate” compAmp files, which were stored in the IBM Cloud and had not yet been classified. The goal of the Stanford projects was to develop new techniques to help classify these signals in a manner which might help the Spark@SETI scientists to isolate signals for deeper analysis, or conversely to better understand the source of signals which should subsequently be filtered out as RFI.

**Taking Aim at Signal Modulation**

The radio signals in a compAmp file are often rendered as a spectrogram, or “waterfall plot”, which shows energy density as a function of time and frequency. Many of the signals detected by the ATA show a narrow band signal with a steady change, or “Doppler drift” in frequency that is caused by the relative linear acceleration between the transmitter and the motion of the ATA receiver as the earth spins on its axis.

However, some signals show non-linear drift which modulate in an apparently random or semi-periodic fashion. The source of these “squiggle” signals is not known, which makes them an interesting point of focus for the Stanford research teams.

The Stanford teams used IBM Spark services to analyze hundreds of thousands of signals stored in the IBM Cloud, looking for unique features that could isolate examples of these signals with random modulations.

If the students could develop such a classification capability, the results could be correlated with other archive data and lead researchers to an explanation – which could be as mundane as malfunctioning equipment or nearby interference that could then be eliminated, or perhaps it will deepen the mystery and prompt further investigation.

** **

**Fourier Analysis and Autocorrelation**

One approach to extracting potentially useful scalar features from compAmp files involves the use of IBM Spark services to perform Fourier analysis. The“squiggle” modulated signals are narrow band in the frequency domain with a central frequency that modulates over time in a range significantly greater than its bandwidth. Therefore a broad peak in the frequency spectrum is expected when a time series containing a squiggle is Fourier transformed as a whole and plotted as frequency domain intensities.

This proved to be a useful technique for an “approximation search” to estimate the overall proportion of the data records that might contain squiggles, and helped to isolate additional positive examples of squiggles for subsequent development of more refined classification methodologies, such as machine learning and image processing of signal spectrograms.

A modulated signal represented by (a) a CompAmp, (b) a spectrogram, and (c) the Fourier spectrum of the CompAmp. The autocorrelation of the Fourier spectrum (d) shows a slow decay.

The existence of one or multiple such broad peaks in a Fourier spectrum leads to a slow decay near the zero of the autocorrelation (with periodic boundary condition) of the Fourier spectrum. Since the intensity is always real and positive, the autocorrelation is defined by:

*cor[I(f)](w) =**∮ (f)I(f )df*

A simple criterion was used to quantify such slow decay:

*cor[I(f)](width) > threshold ∙ min(cor[I(f)])*

where the minimum value of the autocorrelation corresponds to the overall background level in the spectrum.

The values of the parameters ** width** and

**in the criterion were determined by performing an optimization search on squiggle samples. Half of the samples were used in the optimization and the remaining half were reserved for validation. Smaller values in either of the parameters in general result in a more tolerant criteria. Consequently the false negative rates can be reduced by decreasing either of the parameters. However, more tolerant criteria likely yield higher false positive rates. By selecting a tradeoff point where further reducing either of the parameters no longer significantly reduce the false negative rate, the optimal parameters are chosen to be**

*threshold*

*width = 15***, at which a false negative rate of 2% were obtained on both the training set and the validation set.**

*and threshold = 1.0027*By deploying the routine described above on the IBM Spark services and analyzing all compAmp files stored in the IBM Cloud, roughly 20,000 were found to exhibit broad peaks in their Fourier spectra. By visually examining random samples of the resultant spectrograms, 68% of the records that satisfy the criterion were found to be false positive produced by broad vertical features other than squiggles. Other techniques, described below, were assessed to further reduce these false positives.

**Classification using Image Processing**

One promising approach to signal classification is based on the fact that compAmp signals can be transformed into “waterfall plot” spectrograms, which are images that can then be processed using a myriad of powerful image tagging and classification algorithms available in open source.

The scalarinvariant feature transform (SIFT) is one such algorithm that was investigated for ATA signal classification. The SIFT algorithm breaks down an image into the local feature coordinates that capture the “keypoints” of the most descriptive elements of an image, much as humans intuitively do when describing a picture, as shown in this SIFT keypoint analysis of a picture of a house.

This same technique was applied to waterfall plots, which identified the “keypoints” of signal modulation, as shown in this example with the SIFT features circled in green.

An interesting innovation developed by Stanford researchers was to combine the SIFT keypoints into a single Fisher vector that represented key global image features. First, the SIFT features of the waterfall plot were identified, then the SIFT features of all of the images are collected together to form a gaussian mixture model by least likelihood. The final step is to then compare the SIFT images of one particular signal to the Gaussian mixture model, and the residuals are referred to as Fisher vectors, as shown below:

SIFT Keypoints Fish Vector

**Enhanced Classification by Combining Features**

The Fisher vector above can be used in conjunction with other scalar features derived from the waterfall plot to provide a powerful new way of clustering signals for further analysis, and to rapidly classify new signals for appropriate processing.

This classification method aggregates scalar features (such as the standard deviations of signal energy over time and frequency, the autocorrelation width, and Shannon entropy) with the Fisher vector to create one larger feature vector. Using kmeans, the squiggle signals are then clustered using these vectors, and the nonsquiggle spectrogram segments are clustered into 10 additional clusters. This number of clusters was chosen based on empirical results, and resulted in the lowest false negative rate.

In order to classify a new signal, its feature vector is calculated and the closest cluster is determined by Cartesian distance. This resulted in a significant improvement in false positive performance over the previous Fourier and autocorrelation analysis. The classification scheme was trained using a random half of the data, and then validated on the remaining half. Ten such random trials were carried out, and the average misclassification rates were found to be 2.0% false negative and 0.2% false positive.

**Conversion of spectrograms into discrete timepoints**

Another approach to analyzing the characteristics of modulating “squiggle” signals was treat each squiggle as a discrete time series, by selecting one frequency from each time slice in a spectrogram. The initial approach relied solely on intensity, selecting the frequency with the maximum intensity from each timeslice. However, this approach failed to trace the squiggle accurately in the presence of strong background noise. Furthermore, many squiggles had gaps, sets of time slices over which the signal disappears almost entirely. To solve this issue of interpolation and to exclude outlying points, an optimal path was sought to minimize the following loss:

where *I(x, y)* gives the intensity at some discrete point *(x, y)*, and

represents the intensities of surrounding points. α and β are the parameters of the loss function, and it was determined that α = 0:5 and β = 0 produced the best results. Hence, the final loss function was:

This loss function proved to be a reliable measure of overall intensity and coherence of the signals in the data set, and a powerful feature for subsequent time series analysis. The other feature used was signal* Width*, where I = the max intensity of a given time slice, and *i* is the corresponding index, the estimate of signal width of all 129 time slices is based on the number of indices in *[i**-10, i+10]* with intensities >= ½ I, and took the mean of signal widths falling between the 10^{th} and 90^{th} percentiles.

**Getting Smarter with Machine Learning**

The researchers explored many different models and combinations of features to cluster and accurately classify “squiggle” modulated signals. To ensure an unbiased classification, the full dataset was split into 90% training and 10% test. Using the training set, they applied 10-fold cross validation to tune model parameters. Performance metrics were then scored by applying the fitted classifier to the 10% held-out validation set.

The final model incorporated 72 features, comprised of the 63 FFT frequency samples, the 4 parameters from the ARIMA(1,1,1) model, the variance, modulation, the hurst exponent of the 129-slice time series, the signal width extracted from the spectrogram, and the loss from the dynamic programming algorithm.

Four families of classifiers were then applied: 1) Logistic Regression using L1, L2, and no regularization; 2) Support Vector Machines (SVM) using linear, radial, polynomial, and sigmoid kernels; 3) Tree-Based Methods including boosting, bagging, and random forests; and 4) K-nearest Neighbors (KNN).

In aggregate, the highest ACC score (ACC: accuracy defined as 1 – misclassification error) of 99.2% was achieved using the boosting tree-based method. However, nearly all tree-based methods resulted in relatively low AUC metrics (AUC: area under the Receiver Operating Characteristic curve). Thus, accounting for the unbalanced dataset, the team concluded that unregularized logistic regression is the optimal classifier, with a test AUC of 99.7%.

Before clustering, all features were normalized prior by performing dimensionality reduction using principal component analysis (PCA), which projected the 72 features into 5 principal component vectors, capturing > 98% of the variance.

The hypothesis was that the squiggles likely followed a continuous spectrum rather than occupying distinct subgroups. The team therefore sampled a variety of distance metrics including Euclidean, Manhattan, and Canberra distance, as well as various clustering algorithms, namely k-means and several hierarchical algorithms including single, average, complete, McQuitty, centroid, median, and Ward linkage clustering. They also tried divisive clustering, but this yielded unfavorable results and tended to place outlier points in their own clusters.

For each method, they plotted the average silhouette score over all clusters while varying k from 1 to 15. They observed that 4 clusters produced favorable silhouette scores for the methods which utilized the Euclidean distance metric.

The Stanford researchers questioned whether the four clusters uncovered by each of the algorithms are in fact the same four clusters. To determine this, they mapped the four clusters to each other so as to maximize the proportion of points that are in the same cluster across all three methods, as shown in the corresponding plots below.

The high level of concordance between the clusters generated signifies that the results they represent are robust to the exact method of clustering. Below are examples of squiggles sampled randomly from each of the four clusters (of the points that were assigned unanimously to a cluster).

The presence of 4 clusters may imply the existence of 4 distinct sources, which may be further investigated by the Spark@SETI team.

**What’s Next?**

This collaboration between IBM and the Stanford research teams resulted in powerful new algorithms that have been shared with other the scientists who are using IBM Spark services for SETI research. Using these innovative approaches, the IBM Spark@SETI team hopes to greatly improve the signal classification techniques, and dig into the potential sources of the modulated “squiggle” signals in particular.

In fact, this collaboration has been so successful, that IBM has extended an invitation to the participating students to continue on as active members of the Spark@SETI team while they continue their studies at Stanford.

It will be exciting to see what other innovations emerge from their work, and from future Stanford project teams that will follow.

This blog summarizes the innovative work by both of the Stanford teams. The full details of their work can be reviewed in their respective project reports, which can be viewed using the links below.

**REFERENCES**

[1] __Project SETI: Machine Recognition of Squiggles in SETI Signal Data__: Frank Fan, Kenny Smith, Jason Wang. 2016

[2] __Identification of Frequency Modulated Signals in the Allen Telescope Array Data__: Austin Hou, Rafael Setra, Qi Ian Yang. 2016.