SACG Blog: October 2011

Statistics & Analytics Consultants Group Blog

The Statistics & Analytics Consultants group is a network of over 9,000 members. Businesses have the ability to work with consulting firms and individual consultants and eliminate costs. There is also a job board where you can post statistics and analytics jobs. Our members offer a variety of courses to ensure that your company can compete on analytics. Courses range from basic applied understanding of statistical concepts and methods involved in carrying out and interpreting research to advanced modeling and programming.

This blog is a place where featured members are invited to share their expertise and opinions. Opinions are not necessarily the opinions of SACG.

Saturday, October 22, 2011

SACG – Who We Are and What We Do

Statistics & Analytics Consultants, which was started in 2008, is a group that very shortly will be over 10,000 members worldwide. Statistics & Analytics Consultants Group is dedicated to providing statisticians the opportunity to network with others in their field and share ideas and make business contacts.

Our Goal is to introduce statisticians and analysts to business contacts for consulting opportunities. We also would like statisticians to start discussions to share ideas and best practices and connect with each other. Anyone with a Statistical background is welcome and all statistical disciplines are welcome. The group comprised of those who are involved in different aspects of many disciplines related to statistics and analytics – including actuaries, academia, corporations, banking, programmers, pharmacy, biostatistics, manufacturing, engineering, etc. However, the focus is on supporting the consultant and their skills in the industry in which they practice – from statistical, technical, project management, business development, etc.

On LinkedIn, responding to requests from members, we recently started subgroups in different software areas: SPSS, SAS, R-Project, Excel, and Stata. Heading these subgroups, as moderators are leaders in these particular areas. Some of the discussion topics we have had in the group have included:

· “How to detect fraudulent behavior of sale personnel of the Company through statistical analysis of Sales Data”

· “Checking for Falsification or Duplication of Records”

· “Removing Multicollinearity”

· “Is Statistical analysis a part of Data mining or Data mining is the part of Statistical analysis?”

· “A bank has a test designed to establish the credit rating of a loan applicant. Of the persons, who default (D), 90% fail the test (F). Of the persons, who will repay the bank (ND), 5% fail the test. Furthermore, it is given that 4% of the population is not worthy of credit; i.e., P(D) = .04. Given that someone failed the test, what is the probability that he actually will default?”

Our discussions are rich and varied and the discussions are frequently helpful and sometimes quite vibrant! We invite you to join us on this website, as well as LinkedIn. To post questions in the forums, to share code and datasets (shortly there will be a place for such) and to submit a guest blog, which can be hyperlinked back to your own website or Twitter account. To submit a blog send your submission to: info@statisticalconsultants.net To join our group on LinkedIn, apply at this link: Statistics & Analytics Consultants .

In addition, because we are interested in what statistics, analytics, and business intelligence tools you are using, to better serve you, please take our survey: SACG Survey

Friday, October 21, 2011

Trainer Tip: Multicollinearity, by Steve Poulin, Ph.D., Trainer & Consultant, IBM/SPSS

Multicollinearity (sometimes called Collinearity)

This describes the condition of high correlations among two or more independent variables used in a multiple regression technique. Based on my experience, it is one of the most common threats to accurately ranking the effects of the independent variables used in a regression analysis. This condition affects any technique based on regression principles, including linear regression, binary and multinomial logistic regression, the Cox regression survival analysis technique, and an ARIMA time-series analysis that uses transfer functions to measure the effect of two or more external series.

The most direct test of multicollinearity is available in the Linear Regression procedure (Analyze/Linear Regression) within the IBM SPSS Statistics software and the Regression node in version 14 of the IBM SPSS Modeler software. Within IBM SPSS Statistics, clicking on the Statistics button in the Linear Regression dialog box opens the following subdialog box:

In version 14 of IBM SPSS Modeler, collinearity diagnostics are requested from a very similar dialog box that is invoked from the Regression node:

The collinearity diagnotics option will produce two new columns in the Coefficients table and a Collinearity Diagnostics table. All of these diagnostics will tell a similar story, and the most commonly used diagnostic is the Tolerance statistic that appears in the Coefficients table for each independent variable. The Tolerance statistic measures how much variance in each independent variable is NOT explained by the other independent variables. Tolerance values below .3 (30%) are likely to indicate a problem with multicollinearity, which means that the B and Beta coefficients produced for those variables may be incorrect.

There other methods available in IBM SPSS Statistics and Modeler for detecting multicollinearity. Bivariate correlation tests can be run for all of the independent variables. However, as the name implies, these tests can only test for high correlations among two variables at a time, while multicollinearity refers to the correlations between each independent variable and all of the other independent variables. Nevertheless, if multicollearity is occurring because of high correlations among a few variables, this method will be sufficient. Correlation coefficients above .8 or below -.8 on a scale between -1 and 1 usually indicate multicollinearity at a level that will distort regression coefficients.

Factor analysis is a better test of multicollinearity because it can detect high correlations among any number of variables. Another advantage is that factor analysis can produce factor scores that can be used in lieu of the original independent variables. If orthogonal (uncorrelated) factor scores care created, this method will completely remove multicollinearity (with tolerance values of 1!). However, the coefficients associated with the factor scores used as independent variables in a regression can be difficult to interpret.

Linear Regression in IBM SPSS Statistics and the Regression node in Modeler 14 are the only statistical procedures that offer collinearity tests. However, any set of independent variables can be tested in the Linear Regression procedure, regardless of the regression-based procedures that will be used. Since the collinearity test only applies to the independent variables, any dependent variable can be designated as the dependent variable (even the subject’s ID numbers!). However, to simplify the output, you should deselect the Model fit and Regression coefficients, which are selected by default

The simplest way to fix a multicollinearity problem is to simple pick one of a set of variables that are highly correlated, especially if high correlations suggest redundancy. Another simple solution is to use the mean of highly correlated variables. Regardless of the solution used, tests for multicollinearity should be run before the analyst begins his or her interpretation of the regression coefficients.

About Steve: Steve has been an education consultant for SPSS since November of 1997. For most of that time he worked as an external consultant, and in March of 2010 he began working for IBM SPSS full-time. Before he began working full-time for IBM SPSS, he worked as a researcher for the Center for Mental Health Policy and Services Research at the University of Pennsylvania. Steve received a PhD in Social Policy, Planning, and Policy Analysis from Columbia University. He loves to travel, and is an avid backpacker with his son.

Thursday, October 6, 2011

The Data Miner’s somewhat surprising role as Honest Broker and Change Agent by Keith McCormick

They say that you can’t be a prophet in your own land. As someone who is always an outsider to the organizations that I Data Mine in, I find this to be true. I find that building a model is rarely more than 10-20% of the time I spend in front of the laptop, and fully a third of my time is not spent in front of a laptop at all. This is an explanation of what I find myself doing in all of those many hours that I am not using Data Mining software, or any software. What else is there to do?

Inspire Calm: I am often greeted with the admission that my new client’s Data Warehouse is not quite as complete, nor quite as sophisticated as they would like. No one’s is! It is interesting that it is one of the first facts that is shared because it implies that if only they had the perfect Data Warehouse that the Data Mining project would be easy. Well, they are never easy. Important work is hard work, and no one really has a perfect Data Warehouse because IT has a hard job to do as well. So, the experienced Data Miner is in a good position to explain that the client really isn’t so far behind everyone else.

Advocate for the Analysis Team’s time within their department: Yes, this is a full time endeavor! It is surprising how often Data Mining is confused with ad hoc queries like “How many of X did we sell in Q1 in Region A?” I am not sure where this comes from, but new Data Miners are left wondering how they can perform all six stages of CRISP-DM in time for next Tuesday’s meeting. By the time an external consulting resource is involved this confusion is largely cleared up, but sometimes a little bit of it lingers. How can the internal members perform all of their ongoing functions, and commit to a full time multi-week effort? Of course, they can’t. A bit of realism often sinks in during the first week of a project. Much better addressed earlier than later.

Inspire loftier goals: Data preparation is said to take 70-90% of the effort. I have experienced little to convince me that this estimate is far off. The ‘let’s do something preliminary’ thing can be inefficient if you aren’t careful because on a daily basis one is making decisions about how the inputs interact. Refreshing the model on more recent data is straightforward, but if you substantively change the recipe of the variable gumbo that you are mining, you have to repeat a lot of work, and revisit a lot of decisions. It is possible, with careful planning, to minimize the impact, but you risk increasing (albeit not doubling) the data preparation time. It is ultimately best to communicate the importance of the endeavor, knock on doors, marshal resources, and do the most complete job you can right now.

Act as a liaison with IT: An almost universal truth is that IT has been warned that the Data Miner needs their data, but IT has not been warned that the Data Miner needs their time and attention. Of course, no one wants to be a burden to another team, but some additional burden is inevitable. The analyst about to embark on a Data Mining project is going to have unanswered questions or unfulfilled needs that are going to require the IT team. The external Data Mining resource will often to have to explain to IT management that there is no way to completely eliminate this; that it is natural, and it is not the analysis team’s fault. Concurrent with that, the veteran Data Miner has to anticipate when the extra burden will occur, act to mitigate it, and try to schedule it as conveniently as possible.

Fight for project support (and data) from other departments: Certain players in the organization are expecting to be involved, like IT. Often the word has to get out that a successful Data Mining project is a top to bottom search for relevant data. Some will be surprised that it is a stone in their department that has been left unturned. They may not be pleased. Excited as they may be about the benefit that the entire company will derive, you are catching them at inopportune moment as they leave for vacation, or as a critical deadline looms. Fair warning is always wise, and it should come early. Done properly, the key player in a highly visible project gets a little (not a lot of) political capital which they should spend carefully.

Help get everyone thinking about Deployment and ROI from the start: Far too often it is assumed that the analysts are in charge of the “insights”, and the management team, having received the magic power point slides will pick it up from there, and ride the insights all the way to deployment and ROI. Has this ever happened? The Data Miner must coach, albeit gently, that a better plan must be in place, and the better planning must begin the very first week of a data mining project. Let executives play their critical role, but a little coaching is good for everyone. After all, it might be everyone’s first Data Mining project.

Fade into the background: Everyone wants credit for their hard work, but the wise Data Miner lets the project advocates and internal customers do all the talking at the valedictory meeting. The best place to be is on hand, but quiet. Frankly, if the Data Miner is still shoulder deep in the project, the project isn’t ready for a celebration. The “final” meeting, probably the first of many final meetings should be about passing the torch, reporting initial (or estimated) ROI, and announcing deployment details.

Keith in an independent consultant who blogs at http://www.keithmccormick.com/

From Survey Questions to Business Applications By Dawn Marie Evans & Steven J. Fink

As a manager you have important business questions you need answered – and with the explosion of analytics, managers are expected to use the data to drive decisions. Buzzwords like “Voice of the Customer,” “Customer Segmentation,” “Competitive Intelligence,” and “Business Intelligence are bandied about – but how can you nail down a definitive methodology to answer your important question?

One tool for gaining access to the voice of your customers, employees, or population of interest, is a survey. How do you know when it is time to launch a survey? The short answer to this is when the available data that you have on hand (generally within your company’s databases) fall short in answering your most pressing business questions. Why hire an expert? Because if not properly constructed or sampled, the survey most likely will yield results that will either tell you very little of importance, cannot be joined back to your own data with confidence, or may not be representative of your population of interest. You want to have confidence in the tool itself and in the results that it yields.

Below are two business case examples where surveys have been used to answer important business questions. You may find these of interest within your own business context:

Customer Segmentation for an Online Company

Working with a company whose products were sold exclusively online, they had a database of customer records on hand. However, this information was incomplete regarding certain attitudinal information, as well as behavioral information as to how customers were shopping with competitors – both online and in-store. Launching a survey to a large sample of customers allowed us to gain insight into attitudes and behaviors of customers. Using a clustering technique, customers were segmented into several key segments that had very different characteristics, based on attitudes, shopping preferences, demographics, etc.

Using principal components analysis, the survey was then reduced to just a few main questions. When future customers registered on the site and answered these few questions, along with key demographics, they were placed into one of the segments where they would receive targeted marketing messages. This survey helped to answer business questions of: Who are our customers? What are their motivations for shopping with us? What are their buying behaviors by segment and demographics? Who are the major competitors by segments? From here, the marketing department was able to develop the creative messages targeted specifically to each segment.

What Does a Survey Have to Do With Your Salary?

In another key application, an association requested the administration of an annual Compensation Survey to collect data from their members about how much they earn, how much extra they receive in cash bonuses, and deferred compensation. Survey results may be disaggregated by level of education, position, region of the country, academic vs. non-academic, public vs. private, etc. Associations may also examine trend data of their members over 2, 3, or 5 years. In asking such sensitive information of workers, it is important to hire those who are skilled at constructing surveys in such a way that respondents are likely to follow through to the end of the survey. If you start with questions that are too sensitive early on – or too complex, it is unlikely that those taking the survey will finish. It is also important this be done by evaluators external to a person’s place of business – there needs to be a buffer, a sense of safety in answering questions that may be attitudinal with regards to their work, salary, work environment, and so forth.

Who uses this information? Human Resources departments use this information to figure out how much to offer prospective employees or to determine whether their employees are in line with industry practices. Similarly, prospective employees may use this information to know how much they can expect to earn. Current employees may also use this information to compare their compensation to their peers.

So, the next time you want to know whether you are being paid fairly, go to an association website to compare how much you could be earning. Where did they get this information? From a survey, of course!

If you have an important business question, and you current data cannot provide all the answers, ask Evans Analytics at info@evansanalytics.com to design and analyze a survey for you.