PCA, EFA, PA, and CFA Chong Ho Yu Why do we use factor analysis? Find out what observed items can indicate latent constructs. PCA and EFA
Principal components analysis: find the optimal way of collapsing many correlated variables into a small number of subsets so that the study is more manageable. The subsets do not need to make any theoretical sense. It is for convenience only. Exploratory factor analysis: identify the underlying theoretical structure of diverse variables. If certain items are loaded into a subscale called intrinsic religious orientation, then the items must be related to this construct both mathematically and conceptually.
Example of PCA: Insurance policy The policy variables (Maitra & Yan): Fire Protection Class Number of Building in Policy Number of Locations in Policy
Maximum Building Age Building Coverage Indicator Policy Age Why is it exploratory?
Haig: EFA is abductive. You should consider different alternate factor models. Haig, B. (2014) Investigating the Psychological World: Scientific Method in the Behavioral Sciences (Life and Mind: Philosophical Issues in Biology and Psychology)
Confusion between PCA & EFA Although factor analysis and PCA are two different procedures, some researchers found that the procedures yield almost identical results on many occasions. SPSS makes PCA as the default when EFA is requested.
JMP In JMP there are three ways to do factor analysis Multivariate Principal Components Consumer research JMP If you go though
Principal components, you can request factor analysis after running PCA. or specify maximum likelihood (ML) as the estimation method. SAS Enterprise Guide
Less confusing in SAS Enterprise Guide. Both PCA and EFA are shown in the Tasks menu. But if you do programming, PROC FACTOR in SAS makes PCA as the default method. Check Cronbach Alpha
Caution: You may be fooled When the sample size is extremely large, even random numbers appear to form a pattern. Vectors Showing the directions and relationships. Rotate the eigenvectors
To fully understand how rotation works, you need to understand what eigenvectors are. Eigenvalue = sum of square of factor loadings Eigenvector = visual representation of eigenvalue You need some basic knowledge of matrix algebra and vector geometry to understand eigenvectors. Optional: This webpage explains the detail. http://www.creative-wisdom.com/computer/sas/biplo t.html Factor loading
The number indicates the strength of correlation between the variable and the factor. For example, Var 1 is more related to Factor A than Factor B (.75 vs. 32). We could say Var 1 should be loaded into Factor A, and the number is called factor loading.
Higher is better, no matter whether the factor loading is negative or positive. Simple structure Simple structure suggests that only one variable should be highly related to only one factor.
If some variables have high loadings into several factors, the researcher must rotate the factors. For instance, in the following case most variables are loaded into Factor A, and variable 3 and 5 have high loadings in both Factor A and B. Factor rotation After rotation the structure
should be less messy and simpler. In the following case, variable 1, 3, and 5 are loaded into factor A while variable 2 and 4 are loaded into factor B. Varimax (orthogonal) or Quartimin (non-orthogonal) Various criteria
Kasier criterion the scree plot parallel analysis Many studies had verified that by far PA is the
most accurate method (Buja & Eyubuglu, 1992; Glorfeld, 1995; Horn, 1965; Hubbard & Allen, 1987; Humphreys & Montanelli, 1975; Velicer et al., 2000; Zwick & Velicer, 1986). Scree plot Determine the number of factors How much additional information can I get by adding more complexity into the factor model?
Kasier criterion Just like the cutoff using p value < .05, Kasier criterion (Eigenvalue => 1) is just a convention. If necessary, you should override it. 10
8 6 Eig e n va lu e 4 2 0 -2
0 Dr. Shaynah Neshama developed a scale with two constructs, but EFA suggests six factors based on Kasier criterion => 1. 5 10
15 20 N u m b er o f C o m p o n en ts 25 30 35 Factor loading plot I forced EFA to use a
2-factor solution. When the variables are represented as vectors, it is clear that there are two clusters. Only one item does not belong to any group. Cut it! PA: Resampling
The logic of parallel analysis resembles that of resampling: the number of factors extracted should have eigenvalues greater than those in a random matrix. The algorithm generates a set of random data correlation matrices by bootstrapping the data set (resampling with replacement), and then the average eigenvalues and the 95th percentile eigenvalues are computed. PA: Resampling
The observed eigenvalues are compared against the re-sampled eigenvalues, and only factors with observed eigenvalues greater than those from re-sampling are retained. The resampled result functions as a sampling distribution, in which the observed is compared against. The rationale of using the 95 th percentile of the resampled data eigenvalues is that this is analogous to setting the value of alpha to .05 in hypothesis testing (Cho, Li, & Bandalos, 2009). Underfactoring vs. overfactoring
Parallel analysis can be used with PCA or EFA. Which one should be used? PA with PCA tends to under-factoring (extract fewer factors than what it should be). PA with EFA tends to over-factoring (extract more factors than what it should be). Underfactoring vs. overfactoring
Under-factoring is a more serious problem than over-factoring. In the former scenario the researcher totally misses some information. In the latter the result may include some meaningless factors (Crawford, Green, Levy, Lo, Scott, Svetina, & Thompson, 2010), but the researcher can always trim the redundant
factors later. Underfactoring vs. overfactoring It is better to over-prepare than under-prepare. Consider this analogy: I travel with 2-3 cameras. If I don't need the backup, it is fine. But if I have one camera only and it malfunctions, there is nothing I can do!
If your coauthor sends you a 50-page draft, you can remove the redundant information. If she sends you two pages only, there is nothing you can do! Scree plot: Raw, PA means and 95th percent EQS, SAS or SPSS SAS
Caution: You must have clean data to run the PA program. If you have missing data, you have to remove those observations, otherwise it won't run. It is better to retain only the items that will be used for PA. Nothing else. It will be much easier to read the data. e.g. read all numeric variables into the raw data set. SAS SPSS SPSS can omit missing.
Sample size for EFA No absolute min. sample size requirement. With a variable-to-factor ratio of at least 7 (e.g. you have 3 factors and 21 items or 4 factors and 28 items), the min. n can be between 150-180. Mundfrom, D., Shaw, D., & Ke, T. L. (2005).
Minimum sample size recommendations for conducting factor analyses. International Journal of Testing, 5, 159-168. EFA is not enough We need confirmatory factor analysis (CFA)? Why? 'EFA is an error-prone procedure even when the scale being analyzed has a strong factor structure,
and even with large samples. Our analyses demonstrate that at a 20: 1 subject to item ratio there are error rates well above the field standard alpha = .05 levelIt should be used only for exploring data, not hypothesis or theory testing, nor is it suited to validation of instruments.' Osborne, J. W. (2014). Best practices in exploratory factor analysis (Kindle Locations 2305-2310). Amazon Digital Services. Confirmatory factor analysis You cannot do CFA in SPSS Base or Standard You can use: IBM SPSS AMOS
SAS or JMP connected to SAS EQS Mplus Worst case scenario of factor model Need at least three observed indicators for a construct. If you use TETRAD, you need four. Verify whether your factor model is doable in CFA. If you have two only, you
will have negative DF! Example: Brief COPE Measure coping with adversary Original COPE: 60 items, 4 items per subscale (factor), too long and patients
become impatient. Brief COPE: reduces to 2 items per subscale. Carver, C. (1997). You want to measure coping but your protocol's too long: Consider the brief COPE. International Journal of Behavioral medicine, 4, 92-100. Is it acceptable? Only EFA is used, no CFA. How about NPI?
Neuropsychiatric Inventory has two-items factor. The model is validated in CFA. The degrees of freedom is 34. Someone says it is OK
Raubenheimer (2004): Scales with more than one factor may be identified with as little as two items per factor... (p.60) Yeah! It is OK! Cite this: Raubenheimer, J. (2004). An item selection procedure to maximize scale reliability and validity. SA Journal of Industrial Psychology, 30, 59-64.
Why do you use exceptional models? Please cite the entire passage: Scales with more than one factor may be identified with as little as two items per factor, although these should be seen as the exception. The usual case is that a minimum of 3 items must load significantly on each factor in a multidimensional scale,
for all of the subscales to be successfully identified (p.60) Example: DUREL Duke University Religion Index (DUREL): A brief measure of religiosity Five items and three dimensions: Organizational religious activity: Attending church Non-organizational religious activity: Prayer, meditation Intrinsic religiosity: Subjective
How about one item in a subscale? Can we do that? Yes, we can! The single item must be treated as ordinal. In CFA it is like an observed item instead of a
construct. There will be no issue about DF. But you cannot get Cronbach Alpha from a singleitem. The scores of the three-item subscale can be summed together to form a composite or average score. The number can be treated as continuous. Assignment
Download the data set resilence.jmp from http://creative-wisdom.com/teaching/462/U nit12/ In JMP Choose Analyze Consumer research factor analysis Choose Maximum likelihood (ML) as the estimation method and covariance as the variance scaling. Assignment (continued)
Move every items into Y, except factor 1 sum, factor 2 sum, factor 1 average, factor 2 average. According to the Kasier criterion (eigenvalue => 1) seven factors are extracted. There are 15 items only, meaning that there may be 2 items per subscale!
Press Go and see what will happen. Assignment (continued) According to the rotated factor loading, how many factors should be extracted? Rerun factor analysis. In the first dialog box
click recall. This time use quartimax. Choose 2 factors only. Based on the factor loading plot, how many dimensions are there? Which item does not belong to any dimension? What is your final solution?