Statistics & Analytics Consultants Group Blog

The Statistics & Analytics Consultants group is a network of over 9,000 members. Businesses have the ability to work with consulting firms and individual consultants and eliminate costs. There is also a job board where you can post statistics and analytics jobs. Our members offer a variety of courses to ensure that your company can compete on analytics. Courses range from basic applied understanding of statistical concepts and methods involved in carrying out and interpreting research to advanced modeling and programming.

This blog is a place where featured members are invited to share their expertise and opinions. Opinions are not necessarily the opinions of SACG.

Tuesday, November 8, 2011

Potential Explanatory Variables, Not All Qualify by David Abbott

It is tempting to think that all explanatory variables (also called covariates or independent variables) available for a given project would be valid and useful variables to include in a regression model. Well, it is not that simple. Candidate explanatory variables can prove unsuitable for regression for a number of reasons. Analysts can save themselves time and trouble by evaluating the suitability of candidate explanatory variables both prior to and during analysis. Here’s a handful of ways that a candidate explanatory variable can fail to quality…

Insufficient variation
To learn about the effect of an explanatory variable requires that the distribution of the explanatory variable in the analytic dataset not be too concentrated. For example, you can’t learn much about the effect of age if almost all the subjects you are studying are retirement age and just a handful in their 20s, 30s, 40s or 50s. The extreme case of this problem is a categorical variable that takes on only a single value in the analytic dataset.

Meaning inconsistent
If the meaning of an explanatory variable differs among the experimental units, high bias can result. For example, personal income taken from social security records for subjects aged 10-50 years exhibits this problem. Clearly, low income in ages 10-20 years has a very different meaning from low income in AGES 30-50 years. If income is being used as a proxy for socioeconomic status, such a shift in meaning could lead the analyst to markedly overstate the effect of socioeconomic status on automobile accidents.

Excessive measurement error
Some measurement error in explanatory variables is routinely tolerated. However, an abundance of it can wash out the actual effect of the explanatory variable or, worse, introduce bias. This issue is commonly a concern when subjects self-report on emotionally charged measures, e.g., number of sexual partners during the study period. If an explanatory variable is seriously contaminated by measurement error it should either be cleaned up or not used.

Meaning not generalizable
Usually, it is important for the findings of a study to be arguably generalizable to situations other than the experiment that generated the data. So, explanatory variables that only have meaning in the context of the study are best avoided, treated as nuisance variables, or reserved for investigating quality/bias issues in the study. For example, the gender of the person administering a survey may be useful to check for surveyor induced bias, but including it as an explanatory variable in the primary regression results clearly raises questions about the generalizability of study findings.

Substantially duplicative
Each explanatory variable included in the model should measure a distinct dimension of the data. When two explanatory variables are too similar – either in their meaning or the pattern of their variation (i.e. highly correlated) – regression results are unstable and sometimes not even calculable. For example, chronological age and number of years of driving experience are highly correlated in US adults and so are substantially duplicative. Hence, when both are used in a model of accident rates the variance of both estimates is inflated and results are hard to interpret. This problem is a special case of a more general problem known as multicollinearity.

Influenced by the outcome
One assumption of regression methods is that explanatory variables influence the outcome (also called response variable or dependend variable) but the outcome should not influence the explanatory variables. This is usually the case, for example subject age is often used as an explanatory variable and it is almost always preposterous to think that the outcome being analyzed influences subject age. A subject’s age is what it is regardless of the outcome. However, sometimes the value obtained for a candiate explanatory variable is strongly influenced by the outcome . Consider, for example, a study using students’ ratings of a teacher to explain students’ grades and further assume the ratings are collected from students after the grades are known. The grade received by a student and his/her rating of the teacher are very much intertwined. It is as easy to argue that the grade influences the rating as it is to argue that the rating influences the grade. In this case, the better way to proceed is view teacher ratings and student grades as two outcomes of the instructional process whose success is predicted by explanatory other variables like class size, text used, student success in prior courses, etc. that are not influenced by the grade received by the student or the student’s rating of the teacher. This situation is sometimes called “reverse causation” and when it is present it distorts and dilutes regression findings and very much muddies the conceptual waters of the study.

In conclusion
Put your explanatory variables to the test using the six criteria discussed above, drop or improve the variables found lacking, and I think you will find your effort put toward explanatory variable qualification amply repaid.

David Abbott is currently a statistician at Durham Veterans Affairs Health Services Research where he supports researches in both medicine and public health. He has advanced degrees in Statistics from the University of North Carolina and Computer Science from Clemson University. He is a heavy user of SAS Base, SAS Stat, and other related SAS products.

No comments:

Post a Comment