April 8, 2003

The tale of two datasets

I took S6A_c37 (C2P from S6A) and randomly split it into two sets, A and B. I then ran PROC CATMOD on each using the algorithm I discussed earlier - discarding redundant variables. Hmm... never mind, mabye this doesn't work. There was no match at all - in fact, even after cutting the p values for parameters on A were still bad. It may well be that CATMOD is the wrong tool for trying to do this.

As I suspected, the trouble comes from the large number of variables I'm using. I found a useful thread on Google groups discussing sample size. Essentially the posters suggest >10 events per independent variable, we are way off from there though, events being the minimum number of trials of each class. This would suggest that we need >10x61x7 = 4270, which we certainly don't have. (Though we might have it in the case of the math...). The only recourse is to take down the number of variables. Since we have ~100 of each class, we want <10 variables. Ouch.

Posted by torque at April 8, 2003 3:00 AM
Comments
Post a comment









Remember personal info?