Ken introduced me to LIBSVM, a Support Vector Machine library from NTU. Good? We will see.
The standard reference on truncated SVD is Hansen's The truncated SVD as a method for regularization . I've been using it on the math association experiment with some success, though nothing as good as Ken's Tikhonov regularization. Now, where can I get a copy of BIT? Ahh, here's the man: Per Christian Hansen.
 P.C. Hansen, The truncated SVD as a method for regularization, BIT, 27 (1987) 534--553.
If you cannot read the equations, or if they look like junk, you may need to install MathPlayer. Look for the icon on this page.
Hyvärinen gives the log-likelihood L in a noise-free Independent Component Analysis (ICA) model as
Where does this come from? The i index the output component weights, i.e., each row of . The t index time. Purcell's Maximum Likelihood Estimation Primer was especially helpful here. The t are analogous to each coin toss in the coin toss example. Essentially, the question asked by maximum likelihood is, given a set of observed data, what is the probability distribution most likely to have produced the results. In this case, the probability density for each component is represented by . In this way we can estimate, using data on hand, the probability distribution of each independent component. Now, how does the determinant come into place here?
Forgetting about ICA for the moment, consider a set of uncorrelated random variables each with a "known" distribution . We then sample the set of random variables times. It's clear that the probability of having a certain result will be given by something like
Hmm, that looks almost like what we want, but there is still no way to obtain the determinant. Consider a similar argument for the mixed signals, so that
where is the probability density for . Ahh, here we are. Let's look at the entire set of inputs
From ICA, I know that
To finish this off, we look to our buddy Hyvärinen (who incidentally looks quite young), who says in his tutorial that, in general, for any random vector with density and for any matrix , the density of   is given by
Oh, wait, I'm confused again... I can see how
though this doesn't match L. I can also see how
which looks closer, but using the wrong probability density. Could it be somehow that the summation can allow us to use the "known" density? The question is whether the following statement is true:
I don't off-hand see how it can be...
I figured it out with some help from Ken. It's really quite straight-forward. The hint was not quite right. For , the correct relation is
making it quite clear that
After talking with Pat yesterday it seemed that I was missing the boat on a fundamental technique: stepwise selection. I found a good SAS tutorial at ncsu. In stepwise selection, an attempt is made to remove insignificant variables. The way I had been doing (though more manually) is also mentioned in the text and implemented by using SELECTION=BACKWARD instead of SELECTION=STEPWISE. The idea here is to eject variables below a certain significance level.
To run stepwise logistic regression, we specify entry significance level (SLENTRY=0.#) and staying significance level (SLSTAY=0.#). LACKFIT tests Lack of Fit.
title 'Stepwise logistic regression on class 1 versus class 7';
proc logistic data = A outest=params;
model var16=var1-var15 / selection=stepwise slentry=0.3 slstay=0.35 lackfit;
I took S6A_c37 (C2P from S6A) and randomly split it into two sets, A and B. I then ran PROC CATMOD on each using the algorithm I discussed earlier - discarding redundant variables. Hmm... never mind, mabye this doesn't work. There was no match at all - in fact, even after cutting the p values for parameters on A were still bad. It may well be that CATMOD is the wrong tool for trying to do this.
As I suspected, the trouble comes from the large number of variables I'm using. I found a useful thread on Google groups discussing sample size. Essentially the posters suggest >10 events per independent variable, we are way off from there though, events being the minimum number of trials of each class. This would suggest that we need >10x61x7 = 4270, which we certainly don't have. (Though we might have it in the case of the math...). The only recourse is to take down the number of variables. Since we have ~100 of each class, we want <10 variables. Ouch.
I'd like to be able to randomize and split the dataset so that I can do the so-called 10-fold cross validation. How can I do this in SAS? I found some helpful hints on Marilyn Collins' website. The first step is the command SET which allows you to build a new dataset off of an existing dataset. You can, for instance, only choose triggers values greater than 4, e.g., 5, 6, and 7, using:
This posting by David Ward puts us even closer. Aha, but I struck gold at University of Texas at Austin Statistical Services...
Randomly selecting an approximate proportion
DATA analysis holdout;
IF RANUNI(0) < = 2/3 OUTPUT analysis;
ELSE OUTPUT holdout;
DATA analysis holdout;
RETAIN k 67 n 100;
IF RANUNI(358798) < = k/n THEN DO;
k = k-1;
ELSE OUTPUT holdout;
n = n-1;
DROP k n;
After some discussion with Pat, I decided to attempt to run logisitic regression using downsampled observations as the input variables. For each channel there are 61 points, so in total, we have 61x60=3660, far too many points! After a failed attempted at loading the file into SAS, I chose one channel from S6A, C2P (the best monopolar channel using least square).
Data. The dataset consists of S6's responses to seven words presented aurally. As mentioned earlier, channel 37, C2P, was selected via earlier results using our traditional method. EEG recordings using Neuroscan were recorded in continous mode (.cnt) was filtered at 25 Hz and downsampled to 50 Hz. For each trial, the start point was -200 ms from the stimulus trigger, and the end point was 1000 ms from the same trigger. Data were then converted into a comma-delimited format and imported into SAS.
Procedure. I used the CATMOD procedure in SAS. Among other things, this procedure allows one to run nominal logistic regression. I used the following script:
PROC CATMOD DATA=classify.s6a_c37;
MODEL var62=var1-var61 / noprofile;
ods html body='WWW/tuning/saslogs/test.html';
MODEL var62=var29 var37 var44 / noprofile;
title2 'Fifth cut';
ods html close;
I found an article today from Carnegie Mellon entitled Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs. Paul Kornarek and Andrew Moore from the Autolab compared logistic regression with a number of other classification tools (like Support Vector Machine) and found little difference in performance, contrary to popular belief. Of particular interest to me was their data set and methodology. The data set consisted of life sciences data of up to a million attributes. This is interesting because I was considering doing the same with EEG data but concerned with overfitting.
Data. The authors used two sets of data, ds1 and ds2. ds1 consists of ~6,000 binary-attributes and 27,000 rows, and ds2 ~1,000,000 binary-valued attributes and 90,000 rows. The interesting thing about the latter is that the number of attributes exceeds the number of rows. There are caveats however. Unlike our attributes (the waveform, and potentially the spectrum), ds1 and ds2 are sparse, meaning that there are few non-zero attributes. In fact, they basically end up showing reducing the data down to 10 and 100 attributes using Principle Component Analysis (PCA) give similar results to using the full set of data.
In our case the attributes are continuous, and, dense. ds1 and ds2 are life sciences data. I suspect that they must encode the presence of certain genes, hence the binary attributes. ds1 has a sparcity factor F = 0.0220 while ds2 has a sparcity factor of F=0.0003. Assuming that the non-zero's are even distributed (which they shouldn't be), we would arrive 132 non-zero elements/row for ds1 and 300 elements/row for ds2. What I will probably try next is to do classification based on logistic regression by channel, and compare channel results. After that, we can see if combining channels (basically putting together the waveforms) can make any difference.
Analysis. Two things were interesting in the paper. To compare results, they used 10-fold cross validation. I should do this on my results as well. They also use an interesting plot called a Receiver Operating Characteristic (ROC) curve. To construct this curve, the dataset rows are first sorted in order of decreasing probability. Then, starting from the graph origin, the rows are stepped through moving one unit up if the row is positive and right otherwise. On a dataset of P positive and R-P negative rows, a perfect lerner starts at the origin, goes up to (0,P) and then goes straight to (R-P,P). A random guesser moves from (0,P) direct to (R-P,P).
The success of the learning is measured by the area under the [ROC] curve (AUC). A perfect learner has an AUC of 1.0 while a random guesser has an AUC of 0.5. The data as partitioned 10 times (10 fold) to calculate the standard deviation of the AUC. The same 10 partitions are used to compare various algorithms, and comparisons are made pairwise using the same partition.
On an earlier post, I described my first (rather weak) attempt at logistic regression. It failed because I did not structure the problem properly. I was trying to see which channels contributed best to recognition using each channel's classification rate. Of course, this is exactly what the algorithm gives back. In order to use logisitic regression in our prolbem, we need to use it to give back not the classification rate of each channel, but the probability that the trial is a particular class given the results of each channel. I've italicized results because we can either look at the output of our least squares or correlation coefficient classification, or, we can look at a more basic level at the waveforms themselves. I will start with the former. Besides binary responses, logisitic regression can be used to classify both nominal and ordinal responses. Nominal responses are unordered, e.g., French, Italian, or Thousand Island. Ordinal responses are ordered, e.g., no pain, slightly painful, or really painful. In our particular application the classification classes are nominal. This type of analysis is known as nomial, multinomial or polytomous logisitic regression.
In SAS, PROC LOGISTIC is used for ordinal logistic regression while PROC CATMOD is used for nominal logistic regression. I will be using the latter. Fortunately, SAS gives some examples on how to use this in command line mode. An even better tutorial can be found at Queens. Unfortunately, I could discern no way of using this command using Analyst. Incidentally, I found a quick reference on file handling in SAS. Be careful not to erase your file. The DIRECT keyword should not be used, as it specifies that the data should be treated as quantitative rather than qualititive. In this case the classification gives the class which is qualititative.
The data should be formatted like so:
How to structure the problem is a big issue. Usually logisitic regression is used in situations where the potential factors are quite clear. For instance, weight, sex, family history as they relate to a disease like heart disease. In our case, it is not so evident what the factors to use in the model are. We could, for instance, use logistic regression as a sort of bootstrap for our existing classification scheme. We can determine how each channel contributes to correct classification and create a weight. Or, we could start at a more basic level and let each observation point be a factor to be analyzed. A consideration in the latter case is the great potential for overfitting.
Potential methods. In the current implementation, we are able to obtain, for each training trial, N channels of match or no match. We have this for each grid point. We can go further, by considering the class that each channel matches too (1 through 7). The next level would be to examine the least squares or correlation coefficient to each class. Finally, we can discard our measure all together and use each observation point (or Fourier component) as a factor in our analysis. Note: From now on I will only consider data downsampled using Marcos' (traditional) method so that there is no controversy regarding the base data.
Match - first cut. In the "match" analysis, rather than simply using results of the best channel, we can consider all other channels. But how can we weight the significance of other channels? We can do it with logistic regression. In this case, the response is binary - either 1 or 0, does match or does not match. The dependent variables are the results for each channel, either 1 or 0. In VA monopolar data there are 60 channels. However, in the bipolar data there are 1770 channels - clearly a recipe for overfitting. We can reduce the data set by considering only the best m channels, or, we can run all of them and take out channels that seem irrelevant after we do the logisitic regression.
To build the model, we would something like:
I've been playing around with logistic regression on SAS. In case you should happen to need to do the same, here are a few useful things. First of all, as it helps to have a GUI, get VNC installed and running. This will let you run X-windows on a PC. I was fortunate to find some helpful scripts from Chaiyasit Manovit, who incidentally rocks. Unfortunately I had problems saving file templates but still did not have too bad of a time importing my data. A good place to start is the Analyst (under Solutions->Analysis->Analyst).
Pat suggested looking at logistic regression as a natural step from correlation coefficient. It seems quite complicated though.
Now that there is something to write about, there are some additional things to try.
For brain wave classification, which is a better measure of closeness, least-square or correlation coefficient? Today's computation gives some evidence that correlation coefficient might be a more effective measure. I took data from S6 and S7 from an earlier 60-channel EEG experiment and ran a 3 bin classification on monopolar and derived bipolar. Stimuli consisted of 7 words, first, second, third, yes, no, right and left, and were either auditory (S6A, S7A) or visual (S6V, S7V). In every case of the derived bipolar, use of the correlation coefficient increased the recognition rate by at least 0.7σ, in one case up to 1.8σ. Dfferences in rates for the monopolar montage were less conclusive, though, for every data set, classification on the derived bipolar using correlation coefficient performed the best. More details here.