Statistics & Analytics Consultants Group Blog

The Statistics & Analytics Consultants group is a network of over 9,000 members. Businesses have the ability to work with consulting firms and individual consultants and eliminate costs. There is also a job board where you can post statistics and analytics jobs. Our members offer a variety of courses to ensure that your company can compete on analytics. Courses range from basic applied understanding of statistical concepts and methods involved in carrying out and interpreting research to advanced modeling and programming.

This blog is a place where featured members are invited to share their expertise and opinions. Opinions are not necessarily the opinions of SACG.

Thursday, November 10, 2011

Leveraging Text Analytics to Help Answer Your Business Questions, by Dawn Marie Evans

Even though the word “Analytics” has exploded everywhere on the business scene, this field is really still in its infancy.  One of the problems with the word is that “Analytics” means different things to different people.  For example, when talking about “Google Analytics,” this generally means web foot-traffic, represented in counts, charts, frequencies, etc.  For statisticians and data miners, “Analytics” refers to taking data, whether it is financial records, customer data, behavioral data, etc. and building predictive models – models that tell us about likely future behavior – that are not just descriptive of past or current phenomena but predictive of future phenomena:  The purpose is to develop a model to answer important and actionable business questions.

“Analytics” may also refer to using open-ended fields – or textual data to create categories which can be joined back to structured data sets through a technique known as National Language Processing (NLP).  It is important to point out that these methods are sensitive to the context.  For example, if the word that is being viewed is “football,” the algorithms that are applied are able to determine if the word is being used in a negative or positive or even neutral way, such as, “He hates football, “(negative) versus, “They were excited about the football game” (positive).  During the process, the analyst, just as with structured data, makes many important choices along the way.

One of the questions I am frequently asked is what type of textual data can be analyzed?  The answer is almost any type of data and very large datasets are desirable.  Examples of these datasets include streaming data (RSS) feeds from the web, Twitter feeds, blogs, PDF documents, open-end questions on surveys.  Analyzing these datasets can be very labor-intensive and time-consuming.  We are in an age where information has become overwhelming; processing and analyzing such information may be difficult, non-standardized, and expensive.  Text analytics/text mining is a standardized, less expensive approach to glean competitive intelligence and to acquire a better understanding of the voice of customers.  Using a data mining stream one can continuously run it, and refresh it to find new and important results at regular intervals.

What does it take to have a text analytics model built? Evans Analytics uses SPSS Modeler, which has a set of premier text analytics tools. SPSS Modeler comes with libraries already built-into the software.  A library is a pre-defined set of sensitive terms and algorithms that can identify and categorize words and phrases. These libraries are a great place to start with a new project. 

Many clients will request that an analyst take the project a step or two further. The next step would be for  the analyst to build custom libraries – specifically developed for the industry, the company, or the project that is analyzed so that the most relevant terms are developed.  These libraries may be saved and be reused, as needed.

 Some clients may just want simple counts.  For example, a client may only want to know a percentage of customers who preferred product X to product Y or a higher percentage of customers provided more positive comments than negative comments about a particular service.  Other clients may request  newly created categories to join back to other structured data, and then predictive modeling or customer segmentation. They may also want to know that customers who preferred product X were also more likely to live in a specific region, be in a certain age range, and also drive a minivan!  Text Analytics becomes more powerful when added to other data to examine whether differences occur by subgroup.

So, how can you leverage text analytics for your business?  Do you have competitors who are blogging or Tweeting or are there news or RSS feeds that are out there as competitive intelligence, but you haven’t gleaned the important information from them that you should be leveraging?  Do you have open ends in surveys that have overwhelmed you, but you know that important information can be extracted? Do you have research that has previously been handled through qualitative methods, but you think it would be stronger if it was analyzed and joined with your structured data?  If you have answered yes to one of these questions, you have a strong case to consider text analytics!

In my next installation, I will explain how to bring previously constructed categories into SPSS Modeler and re-use old qualitative research in a quantitative way. 

Dawn Marie Evans is Group Owner and Manager of SACG; She is an external consultant and trainer at IBM/SPSS and Managing Partner at Evans Analytics.

Tuesday, November 8, 2011

Potential Explanatory Variables, Not All Qualify by David Abbott

It is tempting to think that all explanatory variables (also called covariates or independent variables) available for a given project would be valid and useful variables to include in a regression model. Well, it is not that simple. Candidate explanatory variables can prove unsuitable for regression for a number of reasons. Analysts can save themselves time and trouble by evaluating the suitability of candidate explanatory variables both prior to and during analysis. Here’s a handful of ways that a candidate explanatory variable can fail to quality…

Insufficient variation
To learn about the effect of an explanatory variable requires that the distribution of the explanatory variable in the analytic dataset not be too concentrated. For example, you can’t learn much about the effect of age if almost all the subjects you are studying are retirement age and just a handful in their 20s, 30s, 40s or 50s. The extreme case of this problem is a categorical variable that takes on only a single value in the analytic dataset.

Meaning inconsistent
If the meaning of an explanatory variable differs among the experimental units, high bias can result. For example, personal income taken from social security records for subjects aged 10-50 years exhibits this problem. Clearly, low income in ages 10-20 years has a very different meaning from low income in AGES 30-50 years. If income is being used as a proxy for socioeconomic status, such a shift in meaning could lead the analyst to markedly overstate the effect of socioeconomic status on automobile accidents.

Excessive measurement error
Some measurement error in explanatory variables is routinely tolerated. However, an abundance of it can wash out the actual effect of the explanatory variable or, worse, introduce bias. This issue is commonly a concern when subjects self-report on emotionally charged measures, e.g., number of sexual partners during the study period. If an explanatory variable is seriously contaminated by measurement error it should either be cleaned up or not used.

Meaning not generalizable
Usually, it is important for the findings of a study to be arguably generalizable to situations other than the experiment that generated the data. So, explanatory variables that only have meaning in the context of the study are best avoided, treated as nuisance variables, or reserved for investigating quality/bias issues in the study. For example, the gender of the person administering a survey may be useful to check for surveyor induced bias, but including it as an explanatory variable in the primary regression results clearly raises questions about the generalizability of study findings.

Substantially duplicative
Each explanatory variable included in the model should measure a distinct dimension of the data. When two explanatory variables are too similar – either in their meaning or the pattern of their variation (i.e. highly correlated) – regression results are unstable and sometimes not even calculable. For example, chronological age and number of years of driving experience are highly correlated in US adults and so are substantially duplicative. Hence, when both are used in a model of accident rates the variance of both estimates is inflated and results are hard to interpret. This problem is a special case of a more general problem known as multicollinearity.

Influenced by the outcome
One assumption of regression methods is that explanatory variables influence the outcome (also called response variable or dependend variable) but the outcome should not influence the explanatory variables. This is usually the case, for example subject age is often used as an explanatory variable and it is almost always preposterous to think that the outcome being analyzed influences subject age. A subject’s age is what it is regardless of the outcome. However, sometimes the value obtained for a candiate explanatory variable is strongly influenced by the outcome . Consider, for example, a study using students’ ratings of a teacher to explain students’ grades and further assume the ratings are collected from students after the grades are known. The grade received by a student and his/her rating of the teacher are very much intertwined. It is as easy to argue that the grade influences the rating as it is to argue that the rating influences the grade. In this case, the better way to proceed is view teacher ratings and student grades as two outcomes of the instructional process whose success is predicted by explanatory other variables like class size, text used, student success in prior courses, etc. that are not influenced by the grade received by the student or the student’s rating of the teacher. This situation is sometimes called “reverse causation” and when it is present it distorts and dilutes regression findings and very much muddies the conceptual waters of the study.

In conclusion
Put your explanatory variables to the test using the six criteria discussed above, drop or improve the variables found lacking, and I think you will find your effort put toward explanatory variable qualification amply repaid.

David Abbott is currently a statistician at Durham Veterans Affairs Health Services Research where he supports researches in both medicine and public health. He has advanced degrees in Statistics from the University of North Carolina and Computer Science from Clemson University. He is a heavy user of SAS Base, SAS Stat, and other related SAS products.