December 16, 2003

libsvm - cool

Ken introduced me to LIBSVM, a Support Vector Machine library from NTU. Good? We will see.

Posted by torque at 3:22 PM | Comments (0) | TrackBack

December 11, 2003

Truncated SVD

The standard reference on truncated SVD is Hansen's The truncated SVD as a method for regularization [1]. I've been using it on the math association experiment with some success, though nothing as good as Ken's Tikhonov regularization. Now, where can I get a copy of BIT? Ahh, here's the man: Per Christian Hansen.

[1] P.C. Hansen, The truncated SVD as a method for regularization, BIT, 27 (1987) 534--553.

Posted by torque at 3:17 PM | Comments (0) | TrackBack

December 2, 2003

Maximum likelihood estimation

If you cannot read the equations, or if they look like junk, you may need to install MathPlayer. Look for the icon on this page.

Hyvärinen gives the log-likelihood L in a noise-free Independent Component Analysis (ICA) model as

L= t=1 T i=1 n log f i ( w i T x(t))+Tlog| detW |

Where does this come from? The i index the output component weights, i.e., each row of W . The t index time. Purcell's Maximum Likelihood Estimation Primer was especially helpful here. The t are analogous to each coin toss in the coin toss example. Essentially, the question asked by maximum likelihood is, given a set of observed data, what is the probability distribution most likely to have produced the results. In this case, the probability density for each component is represented by f i . In this way we can estimate, using data on hand, the probability distribution of each independent component. Now, how does the determinant come into place here?

Forgetting about ICA for the moment, consider a set of uncorrelated random variables y i  each with a "known" distribution f i . We then sample the set of random variables T  times. It's clear that the probability of having a certain result y  will be given by something like

p(y)= t=1 T i=1 n f i ( y i (t))

Hmm, that looks almost like what we want, but there is still no way to obtain the determinant. Consider a similar argument for the mixed signals, so that p(x)= t=1 T i=1 n p i ( x i (t))

where p i  is the probability density for x i . Ahh, here we are. Let's look at the entire set of inputs

p(x)= t=1 T p x (x(t))

From ICA, I know that

x=Ay= W 1 y

To finish this off, we look to our buddy Hyvärinen (who incidentally looks quite young), who says in his tutorial that, in general, for any random vector x  with density p x  and for any matrix W , the density of y=Wx   is given by p x (Wx)| detW |

Oh, wait, I'm confused again... I can see how

p(x)= t=1 T p x (x(t)) = t=1 T p y ( W 1 y(t))| det W 1 | logp(x)= t=1 T p y ( W 1 y(t)) Tlog| detW |

though this doesn't match L. I can also see how

p(y)= t=1 T p y (y(t)) = t=1 T p x (Wx(t))| detW | logp(y)= t=1 T p x (Wx(t)) +Tlog| detW |

which looks closer, but using the wrong probability density. Could it be somehow that the summation can allow us to use the "known" density? The question is whether the following statement is true:

p x (Wx(t))= i=1 n f i ( w i T x(t))

I don't off-hand see how it can be...

I figured it out with some help from Ken. It's really quite straight-forward. The hint was not quite right. For y=Wx MathType@MTEF@5@5@+=feaafeart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaaCyEaiabg2da9iaahEfacaWH4baaaa@39D5@ , the correct relation is

p y (y)= p x (x) | detW | MathType@MTEF@5@5@+=feaafeart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiCamaaBaaaleaacaWH5baabeaakiaacIcacaWH5bGaaiykaiabg2da9maalaaabaGaamiCamaaBaaaleaacaWH4baabeaakiaacIcacaWH4bGaaiykaaqaamaaemaabaGaciizaiaacwgacaGG0bGaaC4vaaGaay5bSlaawIa7aaaaaaa@46DD@


p x (x)= p y (y)| detW | MathType@MTEF@5@5@+=feaafeart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiCamaaBaaaleaacaWH4baabeaakiaacIcacaWH4bGaaiykaiabg2da9iaadchadaWgaaWcbaGaaCyEaaqabaGccaGGOaGaaCyEaiaacMcadaabdaqaaiGacsgacaGGLbGaaiiDaiaahEfaaiaawEa7caGLiWoaaaa@46CD@

making it quite clear that

logp(x)= t=1 T p y (y(t))| detW | = t=1 T p y (y(t)) +Tlog| detW |= t=1 T i=1 n f i ( w i T x(t)) +Tlog| detW | MathType@MTEF@5@5@+=feaafeart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaciiBaiaac+gacaGGNbGaamiCaiaacIcacaWH4bGaaiykaiabg2da9maaqahabaGaamiCamaaBaaaleaacaWH5baabeaakiaacIcacaWH5bGaaiikaiaadshacaGGPaGaaiykamaaemaabaGaciizaiaacwgacaGG0bGaaC4vaaGaay5bSlaawIa7aaWcbaGaamiDaiabg2da9iaaigdaaeaacaWGubaaniabggHiLdGccqGH9aqpdaaeWbqaaiaadchadaWgaaWcbaGaaCyEaaqabaGccaGGOaGaaCyEaiaacIcacaWG0bGaaiykaiaacMcaaSqaaiaadshacqGH9aqpcaaIXaaabaGaamivaaqdcqGHris5aOGaey4kaSIaamivaiGacYgacaGGVbGaai4zamaaemaabaGaciizaiaacwgacaGG0bGaaC4vaaGaay5bSlaawIa7aiabg2da9maaqahabaWaaabCaeaacaWGMbWaaSbaaSqaaiaadMgaaeqaaOGaaiikaiaahEhadaqhaaWcbaGaamyAaaqaaiaadsfaaaGccaWH4bGaaiikaiaadshacaGGPaGaaiykaaWcbaGaamyAaiabg2da9iaaigdaaeaacaWGUbaaniabggHiLdaaleaacaWG0bGaeyypa0JaaGymaaqaaiaadsfaa0GaeyyeIuoakiabgUcaRiaadsfaciGGSbGaai4BaiaacEgadaabdaqaaiGacsgacaGGLbGaaiiDaiaahEfaaiaawEa7caGLiWoaaaa@8B88@

Posted by torque at 3:46 PM | Comments (0) | TrackBack

April 9, 2003

Stepwise Logistic Regression

After talking with Pat yesterday it seemed that I was missing the boat on a fundamental technique: stepwise selection. I found a good SAS tutorial at ncsu. In stepwise selection, an attempt is made to remove insignificant variables. The way I had been doing (though more manually) is also mentioned in the text and implemented by using SELECTION=BACKWARD instead of SELECTION=STEPWISE. The idea here is to eject variables below a certain significance level.

To run stepwise logistic regression, we specify entry significance level (SLENTRY=0.#) and staying significance level (SLSTAY=0.#). LACKFIT tests Lack of Fit.

title 'Stepwise logistic regression on class 1 versus class 7';
proc logistic data = A outest=params;
model var16=var1-var15 / selection=stepwise slentry=0.3 slstay=0.35 lackfit;

Posted by torque at 1:10 PM | Comments (2)

April 8, 2003

The tale of two datasets

I took S6A_c37 (C2P from S6A) and randomly split it into two sets, A and B. I then ran PROC CATMOD on each using the algorithm I discussed earlier - discarding redundant variables. Hmm... never mind, mabye this doesn't work. There was no match at all - in fact, even after cutting the p values for parameters on A were still bad. It may well be that CATMOD is the wrong tool for trying to do this.

As I suspected, the trouble comes from the large number of variables I'm using. I found a useful thread on Google groups discussing sample size. Essentially the posters suggest >10 events per independent variable, we are way off from there though, events being the minimum number of trials of each class. This would suggest that we need >10x61x7 = 4270, which we certainly don't have. (Though we might have it in the case of the math...). The only recourse is to take down the number of variables. Since we have ~100 of each class, we want <10 variables. Ouch.

Posted by torque at 3:00 AM | Comments (0)

SAS binning

I'd like to be able to randomize and split the dataset so that I can do the so-called 10-fold cross validation. How can I do this in SAS? I found some helpful hints on Marilyn Collins' website. The first step is the command SET which allows you to build a new dataset off of an existing dataset. You can, for instance, only choose triggers values greater than 4, e.g., 5, 6, and 7, using:

DATA temp;
SET classify.s6a_c37;
IF var62>4;

This posting by David Ward puts us even closer. Aha, but I struck gold at University of Texas at Austin Statistical Services...

Randomly selecting an approximate proportion

DATA analysis holdout;
SET alldata;
IF RANUNI(0) < = 2/3 OUTPUT analysis;
ELSE OUTPUT holdout;

Randomly selecting an exact proportion

DATA analysis holdout;
SET alldata;
RETAIN k 67 n 100;
IF RANUNI(358798) < = k/n THEN DO;
k = k-1;
OUTPUT analysis;
ELSE OUTPUT holdout;
n = n-1;
DROP k n;

The later is what I want. The webpage is pretty cool though, it also has code for doing fancy resampling techniques such as Jackknife, Split-Sample, and Bootstrap.

Posted by torque at 1:50 AM | Comments (0)

April 7, 2003

Observation-based logistic regression

After some discussion with Pat, I decided to attempt to run logisitic regression using downsampled observations as the input variables. For each channel there are 61 points, so in total, we have 61x60=3660, far too many points! After a failed attempted at loading the file into SAS, I chose one channel from S6A, C2P (the best monopolar channel using least square).

Data. The dataset consists of S6's responses to seven words presented aurally. As mentioned earlier, channel 37, C2P, was selected via earlier results using our traditional method. EEG recordings using Neuroscan were recorded in continous mode (.cnt) was filtered at 25 Hz and downsampled to 50 Hz. For each trial, the start point was -200 ms from the stimulus trigger, and the end point was 1000 ms from the same trigger. Data were then converted into a comma-delimited format and imported into SAS.

Procedure. I used the CATMOD procedure in SAS. Among other things, this procedure allows one to run nominal logistic regression. I used the following script:

PROC CATMOD DATA=classify.s6a_c37;
RESPONSE logits;
DIRECT var1-var61;
MODEL var62=var1-var61 / noprofile;

There are several important items to note. CATMOD by default categorizes the variables. In this case, our variables are continous, so we must use the command DIRECT. There are a total of 61 observations per trial, var1-var61. The final column, var62, is the class of the trial (1-7).

Results and analysis. The results are difficult to interpret! The format leaves room for improvement... the table to start out occurs halfway through the file and is entitled "Maximum Likelihood Analysis of Variance". The trouble with a multinomial analysis is deciding what variables to throw out. Here's what I did. I looked at the analysis of variance and removed from my model variables which were considered "redundant" by the system. I then re-ran CATMOD. I kept doing this until there were no more "redundant" variables (there is an '*' by in the df column). Then, I took out variables which had p=values which were much larger than the smallest (>0.025). After five iterations, I was left with this: var29, var37, var44. The confusing news is that the likelihood ratio is 1.0. What does this mean? I think it means we have a perfect fit - which may have occurred just because I threw out all the other observations that didn't help. The way to evaluate this is to somehow test it on test data - OR - run the model-making algorithm on two sets of data and see how the results compare. If we end up with the same observation numbers, that is awesome news. Most likely they will be completely different.

Other stuff. Oh, I figured out how to output to HTML using ODS. I regenerated the 'fifth cut' nesting the MODEL command:

ods html body='WWW/tuning/saslogs/test.html';
MODEL var62=var29 var37 var44 / noprofile;
title2 'Fifth cut';
ods html close;

Posted by torque at 9:50 PM | Comments (0)

Fast Robust Logistic Regression

I found an article today from Carnegie Mellon entitled Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs. Paul Kornarek and Andrew Moore from the Autolab compared logistic regression with a number of other classification tools (like Support Vector Machine) and found little difference in performance, contrary to popular belief. Of particular interest to me was their data set and methodology. The data set consisted of life sciences data of up to a million attributes. This is interesting because I was considering doing the same with EEG data but concerned with overfitting.

Data. The authors used two sets of data, ds1 and ds2. ds1 consists of ~6,000 binary-attributes and 27,000 rows, and ds2 ~1,000,000 binary-valued attributes and 90,000 rows. The interesting thing about the latter is that the number of attributes exceeds the number of rows. There are caveats however. Unlike our attributes (the waveform, and potentially the spectrum), ds1 and ds2 are sparse, meaning that there are few non-zero attributes. In fact, they basically end up showing reducing the data down to 10 and 100 attributes using Principle Component Analysis (PCA) give similar results to using the full set of data.

In our case the attributes are continuous, and, dense. ds1 and ds2 are life sciences data. I suspect that they must encode the presence of certain genes, hence the binary attributes. ds1 has a sparcity factor F = 0.0220 while ds2 has a sparcity factor of F=0.0003. Assuming that the non-zero's are even distributed (which they shouldn't be), we would arrive 132 non-zero elements/row for ds1 and 300 elements/row for ds2. What I will probably try next is to do classification based on logistic regression by channel, and compare channel results. After that, we can see if combining channels (basically putting together the waveforms) can make any difference.

Analysis. Two things were interesting in the paper. To compare results, they used 10-fold cross validation. I should do this on my results as well. They also use an interesting plot called a Receiver Operating Characteristic (ROC) curve. To construct this curve, the dataset rows are first sorted in order of decreasing probability. Then, starting from the graph origin, the rows are stepped through moving one unit up if the row is positive and right otherwise. On a dataset of P positive and R-P negative rows, a perfect lerner starts at the origin, goes up to (0,P) and then goes straight to (R-P,P). A random guesser moves from (0,P) direct to (R-P,P).

The success of the learning is measured by the area under the [ROC] curve (AUC). A perfect learner has an AUC of 1.0 while a random guesser has an AUC of 0.5. The data as partitioned 10 times (10 fold) to calculate the standard deviation of the AUC. The same 10 partitions are used to compare various algorithms, and comparisons are made pairwise using the same partition.

Posted by torque at 11:03 AM | Comments (2)

April 4, 2003

Ordinal versus Nominal

On an earlier post, I described my first (rather weak) attempt at logistic regression. It failed because I did not structure the problem properly. I was trying to see which channels contributed best to recognition using each channel's classification rate. Of course, this is exactly what the algorithm gives back. In order to use logisitic regression in our prolbem, we need to use it to give back not the classification rate of each channel, but the probability that the trial is a particular class given the results of each channel. I've italicized results because we can either look at the output of our least squares or correlation coefficient classification, or, we can look at a more basic level at the waveforms themselves. I will start with the former. Besides binary responses, logisitic regression can be used to classify both nominal and ordinal responses. Nominal responses are unordered, e.g., French, Italian, or Thousand Island. Ordinal responses are ordered, e.g., no pain, slightly painful, or really painful. In our particular application the classification classes are nominal. This type of analysis is known as nomial, multinomial or polytomous logisitic regression.

In SAS, PROC LOGISTIC is used for ordinal logistic regression while PROC CATMOD is used for nominal logistic regression. I will be using the latter. Fortunately, SAS gives some examples on how to use this in command line mode. An even better tutorial can be found at Queens. Unfortunately, I could discern no way of using this command using Analyst. Incidentally, I found a quick reference on file handling in SAS. Be careful not to erase your file. The DIRECT keyword should not be used, as it specifies that the data should be treated as quantitative rather than qualititive. In this case the classification gives the class which is qualititative.

The data should be formatted like so:


where the numbers under Ch# indicate the class each channel classifies the trial as. Remember the goal is to derive the appropriate weighting for channel results.

Posted by torque at 9:33 AM | Comments (0)

April 3, 2003

Logits and Classification

How to structure the problem is a big issue. Usually logisitic regression is used in situations where the potential factors are quite clear. For instance, weight, sex, family history as they relate to a disease like heart disease. In our case, it is not so evident what the factors to use in the model are. We could, for instance, use logistic regression as a sort of bootstrap for our existing classification scheme. We can determine how each channel contributes to correct classification and create a weight. Or, we could start at a more basic level and let each observation point be a factor to be analyzed. A consideration in the latter case is the great potential for overfitting.

Potential methods. In the current implementation, we are able to obtain, for each training trial, N channels of match or no match. We have this for each grid point. We can go further, by considering the class that each channel matches too (1 through 7). The next level would be to examine the least squares or correlation coefficient to each class. Finally, we can discard our measure all together and use each observation point (or Fourier component) as a factor in our analysis. Note: From now on I will only consider data downsampled using Marcos' (traditional) method so that there is no controversy regarding the base data.

Match - first cut. In the "match" analysis, rather than simply using results of the best channel, we can consider all other channels. But how can we weight the significance of other channels? We can do it with logistic regression. In this case, the response is binary - either 1 or 0, does match or does not match. The dependent variables are the results for each channel, either 1 or 0. In VA monopolar data there are 60 channels. However, in the bipolar data there are 1770 channels - clearly a recipe for overfitting. We can reduce the data set by considering only the best m channels, or, we can run all of them and take out channels that seem irrelevant after we do the logisitic regression.

To build the model, we would something like:


But does this really help us? I tried running this dummy data set suing SAS: Analysis and got the following results (under "Analysis of Maximum Likelihood"):

parameterDFestimatestd err

What does this mean? Now I'm thinking that I thought about this wrong. I found some notes on deciphering SAS output. I feel like somewhere along the line I needed to tell SAS that I was supposed to get 193/193, but I haven't done that... Basically, the analysis is says that the log of the odds is given by

g = -1.7398 - (0.3108 x <1>) + (0.4903 x <2>) - (0.9738 x <3>)

This is probably not what I wanted. The translation table for the channels is given by:


So, in fact, what we have calculated here is what the expected odds of classifying using any given channel. This does not mix the channels like I wanted. Let's see how well it worked. Suppose my channel is Pz. Then

g_Pz = -1.7398 + 0.3108 - 0.4903 + 0.9738 = -0.9455

p/(1-p) = e^-0.9455 = 0.3885

p = 0.3855/1.3855 = .2782

Of course, this is right where we started, since

.2782 x 193 = 53.7

So, we are still not thinking about this correctly. Though it is nice to see that the results are not to wierd.

Posted by torque at 10:25 AM | Comments (2)

April 2, 2003

SAS and Logistic Regression

I've been playing around with logistic regression on SAS. In case you should happen to need to do the same, here are a few useful things. First of all, as it helps to have a GUI, get VNC installed and running. This will let you run X-windows on a PC. I was fortunate to find some helpful scripts from Chaiyasit Manovit, who incidentally rocks. Unfortunately I had problems saving file templates but still did not have too bad of a time importing my data. A good place to start is the Analyst (under Solutions->Analysis->Analyst).

Posted by torque at 3:54 PM | Comments (1)

March 31, 2003

Logistic Regression

Pat suggested looking at logistic regression as a natural step from correlation coefficient. It seems quite complicated though.

Posted by torque at 10:09 PM | Comments (8)

March 28, 2003

Brain tuning to do's

Now that there is something to write about, there are some additional things to try.

Finer grid
My grid has been quite coarse, a finer grid may or may not lead to better results but certainly take longer to compute.
Ideally I should run several permutations of the data and compare least squares to correlation coefficient, and both against the p=1/7 statistics.
Subtracting the "common" signal
So far I have not made any effort to locate and subtract a common signal across the prototypes. This may be worthwhile. The idea would be to average all the prototypes together, and then subtract that signal from each prototype. I tried it for the math problems but the statistics here should be able to show whether it makes a difference or not.

Posted by torque at 11:06 PM | Comments (0)

Least squares vs. correlation coefficient

For brain wave classification, which is a better measure of closeness, least-square or correlation coefficient? Today's computation gives some evidence that correlation coefficient might be a more effective measure. I took data from S6 and S7 from an earlier 60-channel EEG experiment and ran a 3 bin classification on monopolar and derived bipolar. Stimuli consisted of 7 words, first, second, third, yes, no, right and left, and were either auditory (S6A, S7A) or visual (S6V, S7V). In every case of the derived bipolar, use of the correlation coefficient increased the recognition rate by at least 0.7σ, in one case up to 1.8σ. Dfferences in rates for the monopolar montage were less conclusive, though, for every data set, classification on the derived bipolar using correlation coefficient performed the best. More details here.

Posted by torque at 9:42 PM | Comments (22)