SACG Blog

Statistics & Analytics Consultants Group Blog

The Statistics & Analytics Consultants group is a network of over 9,000 members. Businesses have the ability to work with consulting firms and individual consultants and eliminate costs. There is also a job board where you can post statistics and analytics jobs. Our members offer a variety of courses to ensure that your company can compete on analytics. Courses range from basic applied understanding of statistical concepts and methods involved in carrying out and interpreting research to advanced modeling and programming.

This blog is a place where featured members are invited to share their expertise and opinions. Opinions are not necessarily the opinions of SACG.

Tuesday, January 3, 2012

Adapting Statistical Programming Languages, by David Abbott

The fit between you and any given general purpose statistical analysis language is apt to be far from perfect, especially if you have experience with multiple programming languages. You want to do X but the tool’s language only supports Y. Bummer! But you don’t always have to just tolerate this situation; sometimes, you can make the fit better by clever use of the language’s extensibility features.

Some years ago when using SAS Base, I found myself wanting to repeat chunks of code for each of a small collection of things, e.g., SAS datasets, SAS variables, etc. , but knew from experience that the clone and edit way of coding was not the way to go; it makes code bulky and miserable to maintain. Fortunately, I recalled a useful construct in the UNIX shell language “for kk in ds1 ds2 ds3 do done” that obviated the need to clone and edit. Perhaps I could use the SAS macro feature to achieve something similar in SAS? It seemed worth a try. Basically, I just needed to make three things work to get the intended result:

A way to delineate the set of statements that I wanted to be repeatedly invoked, similar to the do and done of the UNIX shell command. Well, the %macro and %mend statements of SAS could fill that bill.

A means to specify the list of words to successively substitute, a la the “in ds1 ds2 ds3 ds4” of the UNIX shell command. The %let statement of SAS could fulfill that function.

A looping construct to drive the execution through the list of words executing the desired set of statements for each word. The %do %while looping construct of SAS seemed up to the job.

I came up with a SAS macro of several lines with signature %wordLoop(wordlist, contentMacro) and it did the trick for me:

%macro wordLoop(wordList=, contentMacro=);

%local word;

%let cnt=0;

%do %while(1 eq 1);

%let cnt = %eval(&cnt+1);

%let word = %scan(&wordList,&cnt,%str( ));

%if &word= %then %return;

%&contentMacro;

%end;

%mend wordLoop;

For example, here is how I used this macro to winnow several datasets to just the IDs occurring in the dataset subsetOfIdsDs as follows:

%let toBeSubsetted= Ds1 Ds2 DsA DsB DsWhatever;

%macro tmpMacro;

data &word._subset;

merge &word subsetOfIdsDs(in=in2); by id;

if in2;

run;

%mend;

%let toBeSubsetted= Ds1 Ds2 DsA DsB DsC;

%wordLoop(wordList=&toBeSubsetted, contentsMacro=tmpMacro);

After execution I had five new datasets: Ds1_subset, Ds2_subset, DsA_subset, DsB_subset, and DsC_subset all of which had been restricted to the patients/subjects included in subsetOfIdsDs.

The only tricky part to implementing %wordLoop was determining how to get SAS to invoke a macro whose name was provided via the contentMacro parameter of %WordLoop. Fortunately, SAS macro language allows this to be done via the simple construct: %&contentMacro. Of course, the stripped-down implementation of %wordLoop above can be improved on – for example by checking arguments for validity and the like – contact me for my latest version if interested.

The need to perform the same group of statements for multiple datasets comes up frequently for me in statistical analysis . Likewise, in data cleaning, the need to do the same statements for multiple variables comes up frequently. %wordLoop provides a quick and pleasing solution to both situations.

So, let me suggest two take-aways from this blog entry:

For the SAS programmer, %wordLoop is a nifty little macro that helps you avoid clone and edit and it illustrates how you can use the SAS macro facility to extend the SAS command language.

For users of SPSS, R, etc., top tier statistical programming languages provide language extension mechanisms and with a little work and cleverness you can use them to make the language work more the way you want it too; you can improve the fit of the language to the way you like to work.

About David: David Abbott has degrees in statistics and computer science and is currently working for Veterans Affairs Health Services Research. He can be reached at david.abbott@alumni.duke.edu.

Sunday, December 11, 2011

Alter Type: It’s Not What You Think, by Steven J. Fink, Evans Analytics

As I was reviewing a colleague’s SPSS syntax code the other day, I came across a command called “Alter Type.” It sounded like a new scary movie, a psychiatric DSM code, or an abnormal personality attribute.

I looked up this code in the Command Syntax Reference Manual (available through the Help menu) and there it was—a very useful command which can be applied in many applications.

In brief, it does exactly what the name implies. It changes the Variable Type (string or numeric) or Format of variables, including the Width of string variables. As I was reading the explanation, it appears to be a new and improved Format statement, a so-called Format on steroids!

Format statements are often used to change the width and decimals of numeric variables or the format of a date variable. The Alter Type command changes the Variable Type of any variable in one short command—no need to write elaborate or unnecessary code…just one easy statement.

As an example, the dataset below comprises 3 variables and 2 lines of data.

DATA LIST FREE

/Numvar (F2) StringVar (A5) Datevar (Adate10).

BEGIN DATA

1 1234 10/28/2007
4 5678 10/28/2007
End data.

To change a numeric variable to a string (alphanumeric) variable, the command is:

Alter Type Numvar (A2).

To change a string (alphanumeric) variable to a numeric variable, the command is:

Alter Type Stringvar (F6.0).

To change a date variable to a string variable, the command is:

Alter Type Datevar (A10).

One note of caution: The Alter Type command does not allow you to create a new version of the variable. So you may want to save your data first or create a copy of the variable.

So, the next time you need to perform a calculations or merge data and the variable is not in the right Format or Type, use the Alter Type command. After all, it’s free, you are not crazy, and it’s cool!

About Steven: Steven works as a Statistics & Analytics Consultant at Evans Analytics. He has developed or analyzed over 300 surveys including customer satisfaction, work environment, needs assessment, program evaluation, and compensation surveys for public and private sector customers. He has provided SPSS instruction to more than 3,000 analysts, covering a wide variety of topics, including questionnaire design/writing, sample design, data collection strategies, multivariate analyses, and presentation of tables/graphs. He can be reached at: steven@evansanalytics.com

Tuesday, December 6, 2011

What is a Scratch Variable in IBM SPSS Statistics Syntax? by Keith McCormick

This might seem an obscure topic, but it is easily grasped and has the potential to make your SPSS Syntax more readable. Readable is good.

A Scratch Variable is a Variable with a # Symbol in front of it. It is available temporarily for an intermediate step in a Transformation calculation. Once a Procedure occurs, it is no longer available. If the distinction between Transformation and Procedure is new to you, you should put researching that on your to-do list. Start with Appendix B of the Syntax Reference Guide.

You can use the following lines to create a tiny data set.

DATA LIST /LocationName 1-50 (A) .

BEGIN DATA

Raleigh, North Carolina

Durham, North Carolina

Cary, North Carolina

END DATA.

Let’s say that you wanted to pull out just State from the three examples in the data set. The first step would be to identify the location of the comma because the last letter of the last name is always one character before the comma. It is not a constant value because the names are of variable length.

This bit of code will do it:

COMPUTE CommaLocation = INDEX(LocationName,',').

This next step would complete the process, but would also create a new variable that you don’t need.

STRING State (A50).

COMPUTE State = substring(LocationName,CommaLocation+2).

Warning: you only need to run the STRING command once.

Do we actually want to create this variable? What are we going to do with it after we complete the calculation? We could use DELETE VARIABLES once we are done, but we have two better options. In this example, DELETE VARIABLES is harmless, but it would be slower on large data sets, and therefore some programmers would consider it inelegant. It is noteworthy that for decades the language got by just fine without the fairly recent addition of DELETE VARIABLES.

(Note that I have not included EXECUTE commands in any of these code examples. Curious Why? You really shouldn’t use EXECUTE if there will be any procedures later in the code, and there are always procedures later on in the code. That same Appendix B in the Syntax Reference Guide mentioned earlier is a good place to read more about this.)

We could put a function inside of a function:

STRING State(A50).

COMPUTE State = substring(LocationName,INDEX(LocationName,',')+2).

We could also use a Scratch variable:

COMPUTE #CommaLocation = INDEX(LocationName,',').

STRING State(A50).

COMPUTE State = substring(LocationName,#CommaLocation+2).

In an example as straightforward as this, the function inside of a function might be best. As the complexity grows, there will be opportunities to use the Scratch variable option to break up a calculation into two or more steps instead of a single very long, and potentially confusing, line of code.

And who doesn’t want more tools in their Syntax tool chest?

About Keith: Keith McCormick is an independent data mining professional who blogs at: http://www.keithmccormick.com/

Thursday, November 10, 2011

Leveraging Text Analytics to Help Answer Your Business Questions, by Dawn Marie Evans

Even though the word “Analytics” has exploded everywhere on the business scene, this field is really still in its infancy. One of the problems with the word is that “Analytics” means different things to different people. For example, when talking about “Google Analytics,” this generally means web foot-traffic, represented in counts, charts, frequencies, etc. For statisticians and data miners, “Analytics” refers to taking data, whether it is financial records, customer data, behavioral data, etc. and building predictive models – models that tell us about likely future behavior – that are not just descriptive of past or current phenomena but predictive of future phenomena: The purpose is to develop a model to answer important and actionable business questions.

“Analytics” may also refer to using open-ended fields – or textual data to create categories which can be joined back to structured data sets through a technique known as National Language Processing (NLP). It is important to point out that these methods are sensitive to the context. For example, if the word that is being viewed is “football,” the algorithms that are applied are able to determine if the word is being used in a negative or positive or even neutral way, such as, “He hates football, “(negative) versus, “They were excited about the football game” (positive). During the process, the analyst, just as with structured data, makes many important choices along the way.

One of the questions I am frequently asked is what type of textual data can be analyzed? The answer is almost any type of data and very large datasets are desirable. Examples of these datasets include streaming data (RSS) feeds from the web, Twitter feeds, blogs, PDF documents, open-end questions on surveys. Analyzing these datasets can be very labor-intensive and time-consuming. We are in an age where information has become overwhelming; processing and analyzing such information may be difficult, non-standardized, and expensive. Text analytics/text mining is a standardized, less expensive approach to glean competitive intelligence and to acquire a better understanding of the voice of customers. Using a data mining stream one can continuously run it, and refresh it to find new and important results at regular intervals.

What does it take to have a text analytics model built? Evans Analytics uses SPSS Modeler, which has a set of premier text analytics tools. SPSS Modeler comes with libraries already built-into the software. A library is a pre-defined set of sensitive terms and algorithms that can identify and categorize words and phrases. These libraries are a great place to start with a new project.

Many clients will request that an analyst take the project a step or two further. The next step would be for the analyst to build custom libraries – specifically developed for the industry, the company, or the project that is analyzed so that the most relevant terms are developed. These libraries may be saved and be reused, as needed.

Some clients may just want simple counts. For example, a client may only want to know a percentage of customers who preferred product X to product Y or a higher percentage of customers provided more positive comments than negative comments about a particular service. Other clients may request newly created categories to join back to other structured data, and then predictive modeling or customer segmentation. They may also want to know that customers who preferred product X were also more likely to live in a specific region, be in a certain age range, and also drive a minivan! Text Analytics becomes more powerful when added to other data to examine whether differences occur by subgroup.

So, how can you leverage text analytics for your business? Do you have competitors who are blogging or Tweeting or are there news or RSS feeds that are out there as competitive intelligence, but you haven’t gleaned the important information from them that you should be leveraging? Do you have open ends in surveys that have overwhelmed you, but you know that important information can be extracted? Do you have research that has previously been handled through qualitative methods, but you think it would be stronger if it was analyzed and joined with your structured data? If you have answered yes to one of these questions, you have a strong case to consider text analytics!

In my next installation, I will explain how to bring previously constructed categories into SPSS Modeler and re-use old qualitative research in a quantitative way.

Dawn Marie Evans is Group Owner and Manager of SACG; She is an external consultant and trainer at IBM/SPSS and Managing Partner at Evans Analytics.

Tuesday, November 8, 2011

Potential Explanatory Variables, Not All Qualify by David Abbott

It is tempting to think that all explanatory variables (also called covariates or independent variables) available for a given project would be valid and useful variables to include in a regression model. Well, it is not that simple. Candidate explanatory variables can prove unsuitable for regression for a number of reasons. Analysts can save themselves time and trouble by evaluating the suitability of candidate explanatory variables both prior to and during analysis. Here’s a handful of ways that a candidate explanatory variable can fail to quality…

Insufficient variation

To learn about the effect of an explanatory variable requires that the distribution of the explanatory variable in the analytic dataset not be too concentrated. For example, you can’t learn much about the effect of age if almost all the subjects you are studying are retirement age and just a handful in their 20s, 30s, 40s or 50s. The extreme case of this problem is a categorical variable that takes on only a single value in the analytic dataset.

Meaning inconsistent

If the meaning of an explanatory variable differs among the experimental units, high bias can result. For example, personal income taken from social security records for subjects aged 10-50 years exhibits this problem. Clearly, low income in ages 10-20 years has a very different meaning from low income in AGES 30-50 years. If income is being used as a proxy for socioeconomic status, such a shift in meaning could lead the analyst to markedly overstate the effect of socioeconomic status on automobile accidents.

Excessive measurement error

Some measurement error in explanatory variables is routinely tolerated. However, an abundance of it can wash out the actual effect of the explanatory variable or, worse, introduce bias. This issue is commonly a concern when subjects self-report on emotionally charged measures, e.g., number of sexual partners during the study period. If an explanatory variable is seriously contaminated by measurement error it should either be cleaned up or not used.

Meaning not generalizable

Usually, it is important for the findings of a study to be arguably generalizable to situations other than the experiment that generated the data. So, explanatory variables that only have meaning in the context of the study are best avoided, treated as nuisance variables, or reserved for investigating quality/bias issues in the study. For example, the gender of the person administering a survey may be useful to check for surveyor induced bias, but including it as an explanatory variable in the primary regression results clearly raises questions about the generalizability of study findings.

Substantially duplicative

Each explanatory variable included in the model should measure a distinct dimension of the data. When two explanatory variables are too similar – either in their meaning or the pattern of their variation (i.e. highly correlated) – regression results are unstable and sometimes not even calculable. For example, chronological age and number of years of driving experience are highly correlated in US adults and so are substantially duplicative. Hence, when both are used in a model of accident rates the variance of both estimates is inflated and results are hard to interpret. This problem is a special case of a more general problem known as multicollinearity.

Influenced by the outcome

One assumption of regression methods is that explanatory variables influence the outcome (also called response variable or dependend variable) but the outcome should not influence the explanatory variables. This is usually the case, for example subject age is often used as an explanatory variable and it is almost always preposterous to think that the outcome being analyzed influences subject age. A subject’s age is what it is regardless of the outcome. However, sometimes the value obtained for a candiate explanatory variable is strongly influenced by the outcome . Consider, for example, a study using students’ ratings of a teacher to explain students’ grades and further assume the ratings are collected from students after the grades are known. The grade received by a student and his/her rating of the teacher are very much intertwined. It is as easy to argue that the grade influences the rating as it is to argue that the rating influences the grade. In this case, the better way to proceed is view teacher ratings and student grades as two outcomes of the instructional process whose success is predicted by explanatory other variables like class size, text used, student success in prior courses, etc. that are not influenced by the grade received by the student or the student’s rating of the teacher. This situation is sometimes called “reverse causation” and when it is present it distorts and dilutes regression findings and very much muddies the conceptual waters of the study.

In conclusion

Put your explanatory variables to the test using the six criteria discussed above, drop or improve the variables found lacking, and I think you will find your effort put toward explanatory variable qualification amply repaid.

David Abbott is currently a statistician at Durham Veterans Affairs Health Services Research where he supports researches in both medicine and public health. He has advanced degrees in Statistics from the University of North Carolina and Computer Science from Clemson University. He is a heavy user of SAS Base, SAS Stat, and other related SAS products.

Saturday, October 22, 2011

SACG – Who We Are and What We Do

Statistics & Analytics Consultants, which was started in 2008, is a group that very shortly will be over 10,000 members worldwide. Statistics & Analytics Consultants Group is dedicated to providing statisticians the opportunity to network with others in their field and share ideas and make business contacts.

Our Goal is to introduce statisticians and analysts to business contacts for consulting opportunities. We also would like statisticians to start discussions to share ideas and best practices and connect with each other. Anyone with a Statistical background is welcome and all statistical disciplines are welcome. The group comprised of those who are involved in different aspects of many disciplines related to statistics and analytics – including actuaries, academia, corporations, banking, programmers, pharmacy, biostatistics, manufacturing, engineering, etc. However, the focus is on supporting the consultant and their skills in the industry in which they practice – from statistical, technical, project management, business development, etc.

On LinkedIn, responding to requests from members, we recently started subgroups in different software areas: SPSS, SAS, R-Project, Excel, and Stata. Heading these subgroups, as moderators are leaders in these particular areas. Some of the discussion topics we have had in the group have included:

· “How to detect fraudulent behavior of sale personnel of the Company through statistical analysis of Sales Data”

· “Checking for Falsification or Duplication of Records”

· “Removing Multicollinearity”

· “Is Statistical analysis a part of Data mining or Data mining is the part of Statistical analysis?”

· “A bank has a test designed to establish the credit rating of a loan applicant. Of the persons, who default (D), 90% fail the test (F). Of the persons, who will repay the bank (ND), 5% fail the test. Furthermore, it is given that 4% of the population is not worthy of credit; i.e., P(D) = .04. Given that someone failed the test, what is the probability that he actually will default?”

Our discussions are rich and varied and the discussions are frequently helpful and sometimes quite vibrant! We invite you to join us on this website, as well as LinkedIn. To post questions in the forums, to share code and datasets (shortly there will be a place for such) and to submit a guest blog, which can be hyperlinked back to your own website or Twitter account. To submit a blog send your submission to: info@statisticalconsultants.net To join our group on LinkedIn, apply at this link: Statistics & Analytics Consultants .

In addition, because we are interested in what statistics, analytics, and business intelligence tools you are using, to better serve you, please take our survey: SACG Survey

Friday, October 21, 2011

Trainer Tip: Multicollinearity, by Steve Poulin, Ph.D., Trainer & Consultant, IBM/SPSS

Multicollinearity (sometimes called Collinearity)

This describes the condition of high correlations among two or more independent variables used in a multiple regression technique. Based on my experience, it is one of the most common threats to accurately ranking the effects of the independent variables used in a regression analysis. This condition affects any technique based on regression principles, including linear regression, binary and multinomial logistic regression, the Cox regression survival analysis technique, and an ARIMA time-series analysis that uses transfer functions to measure the effect of two or more external series.

The most direct test of multicollinearity is available in the Linear Regression procedure (Analyze/Linear Regression) within the IBM SPSS Statistics software and the Regression node in version 14 of the IBM SPSS Modeler software. Within IBM SPSS Statistics, clicking on the Statistics button in the Linear Regression dialog box opens the following subdialog box:

In version 14 of IBM SPSS Modeler, collinearity diagnostics are requested from a very similar dialog box that is invoked from the Regression node:

The collinearity diagnotics option will produce two new columns in the Coefficients table and a Collinearity Diagnostics table. All of these diagnostics will tell a similar story, and the most commonly used diagnostic is the Tolerance statistic that appears in the Coefficients table for each independent variable. The Tolerance statistic measures how much variance in each independent variable is NOT explained by the other independent variables. Tolerance values below .3 (30%) are likely to indicate a problem with multicollinearity, which means that the B and Beta coefficients produced for those variables may be incorrect.

There other methods available in IBM SPSS Statistics and Modeler for detecting multicollinearity. Bivariate correlation tests can be run for all of the independent variables. However, as the name implies, these tests can only test for high correlations among two variables at a time, while multicollinearity refers to the correlations between each independent variable and all of the other independent variables. Nevertheless, if multicollearity is occurring because of high correlations among a few variables, this method will be sufficient. Correlation coefficients above .8 or below -.8 on a scale between -1 and 1 usually indicate multicollinearity at a level that will distort regression coefficients.

Factor analysis is a better test of multicollinearity because it can detect high correlations among any number of variables. Another advantage is that factor analysis can produce factor scores that can be used in lieu of the original independent variables. If orthogonal (uncorrelated) factor scores care created, this method will completely remove multicollinearity (with tolerance values of 1!). However, the coefficients associated with the factor scores used as independent variables in a regression can be difficult to interpret.

Linear Regression in IBM SPSS Statistics and the Regression node in Modeler 14 are the only statistical procedures that offer collinearity tests. However, any set of independent variables can be tested in the Linear Regression procedure, regardless of the regression-based procedures that will be used. Since the collinearity test only applies to the independent variables, any dependent variable can be designated as the dependent variable (even the subject’s ID numbers!). However, to simplify the output, you should deselect the Model fit and Regression coefficients, which are selected by default

The simplest way to fix a multicollinearity problem is to simple pick one of a set of variables that are highly correlated, especially if high correlations suggest redundancy. Another simple solution is to use the mean of highly correlated variables. Regardless of the solution used, tests for multicollinearity should be run before the analyst begins his or her interpretation of the regression coefficients.

About Steve: Steve has been an education consultant for SPSS since November of 1997. For most of that time he worked as an external consultant, and in March of 2010 he began working for IBM SPSS full-time. Before he began working full-time for IBM SPSS, he worked as a researcher for the Center for Mental Health Policy and Services Research at the University of Pennsylvania. Steve received a PhD in Social Policy, Planning, and Policy Analysis from Columbia University. He loves to travel, and is an avid backpacker with his son.