April 7, 2003

Fast Robust Logistic Regression

I found an article today from Carnegie Mellon entitled Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs. Paul Kornarek and Andrew Moore from the Autolab compared logistic regression with a number of other classification tools (like Support Vector Machine) and found little difference in performance, contrary to popular belief. Of particular interest to me was their data set and methodology. The data set consisted of life sciences data of up to a million attributes. This is interesting because I was considering doing the same with EEG data but concerned with overfitting.

Data. The authors used two sets of data, ds1 and ds2. ds1 consists of ~6,000 binary-attributes and 27,000 rows, and ds2 ~1,000,000 binary-valued attributes and 90,000 rows. The interesting thing about the latter is that the number of attributes exceeds the number of rows. There are caveats however. Unlike our attributes (the waveform, and potentially the spectrum), ds1 and ds2 are sparse, meaning that there are few non-zero attributes. In fact, they basically end up showing reducing the data down to 10 and 100 attributes using Principle Component Analysis (PCA) give similar results to using the full set of data.

In our case the attributes are continuous, and, dense. ds1 and ds2 are life sciences data. I suspect that they must encode the presence of certain genes, hence the binary attributes. ds1 has a sparcity factor F = 0.0220 while ds2 has a sparcity factor of F=0.0003. Assuming that the non-zero's are even distributed (which they shouldn't be), we would arrive 132 non-zero elements/row for ds1 and 300 elements/row for ds2. What I will probably try next is to do classification based on logistic regression by channel, and compare channel results. After that, we can see if combining channels (basically putting together the waveforms) can make any difference.

Analysis. Two things were interesting in the paper. To compare results, they used 10-fold cross validation. I should do this on my results as well. They also use an interesting plot called a Receiver Operating Characteristic (ROC) curve. To construct this curve, the dataset rows are first sorted in order of decreasing probability. Then, starting from the graph origin, the rows are stepped through moving one unit up if the row is positive and right otherwise. On a dataset of P positive and R-P negative rows, a perfect lerner starts at the origin, goes up to (0,P) and then goes straight to (R-P,P). A random guesser moves from (0,P) direct to (R-P,P).

The success of the learning is measured by the area under the [ROC] curve (AUC). A perfect learner has an AUC of 1.0 while a random guesser has an AUC of 0.5. The data as partitioned 10 times (10 fold) to calculate the standard deviation of the AUC. The same 10 partitions are used to compare various algorithms, and comparisons are made pairwise using the same partition.

Posted by torque at April 7, 2003 11:03 AM
Comments

I encoutner this post by accident. I appreciate your serious reading on this paper. May I know how was your experiement going on with logistic regression and AUC based cross-validation?

Posted by: karen at February 25, 2005 12:21 PM

Good job!

Posted by: Markus at December 13, 2006 12:56 AM
Post a comment









Remember personal info?