PCA, PCoA, and Tree Plot using NTSYSpc
This tutorial explains how to perform Principal Component Analysis (PCA) and Principal Coordinate Analysis (PCoA) of multivariate data using the NTSYSpc tool. NTSYSpc (Numerical Taxonomy and Multivariate Analysis SYStem for personal computer) is a simple and efficient statistical tool used to find patterns and display structures in multivariate data. The tool used in this tutorial is NTSYSpc v2.11a.
In this tutorial, I have used a sample allelic data matrix (shown below). The matrix file consists of molecular weight size marker (rows) and sample/loci (columns).
\[M_{mw,s}=\begin{bmatrix}0&0&1&1&0&0\\0&0&1&0&0&0\\0&0&1&1&0&0\\0&0&1&1&1&1\\1&1&1&0&0&0\\0&1&1&1&0&0\\0&0&1&1&1&1\\0&0&0&0&1&1\\1&0&0&0&0&0\\1&1&0&0&0&0\\1&1&0&0&0&0\\0&1&0&0&0&0\end{bmatrix}\]
The NTSYSpc tool accepts various format files namely: NTSYSpc dataset (.NTS) file, NTSYSpc batch (.NTB) file, Microsoft Excel Worksheet (.XLS/.XLSX) file, Comma Separated Values (.CSV) file, and MatLab (.M) file for input/export. It will run in interactive mode (step-by-step procedure) or batch mode (command lines). The NTSYSpc data editor (NTedit) preview of the sample data matrix is below.
The common program parameters used in the NTSYSpc batch file are: o—input data matrix file; r—output similarity or dissimilarity matrix file; output—display matrix file; c—type of coefficient; d—direction of reading the matrix, row (checked by default)–reading matrix by rows as the variables and col–reading matrix by columns as the variables; corr–Pearson correlation coefficient; n—sample sizes for each coefficient (or) dimensions of the output matrix (default is 3); val—eigenvalue matrix file; and f—feature vector matrix file.
NTSYSpc File Format
An NTSYSpc matrix consists of 4 records: comments, matrix parameter line, row and column labels, and matrix data lines.
1. Comments (optional) – comment line describing the data. The first character in each line must begin with double quotation mark (") or single quotation mark/apostrophe (') symbol.
2. Matrix Parameter Line – type and dimension of the matrix. This line contains 4 integer numbers (the second and third may have a suffix letter to indicate the presence and location of row and column labels) separated by at least one blank space. The first number represents the type of matrix, (1) rectangular data matrix, (2) symmetric dissimilarity matrix, (3) symmetric similarity matrix, (4) diagonal matrix, (5) tree matrix for dissimilarity data, (6) tree matrix for similarity data, (7) graph matrix for dissimilarity data, and (8) graph matrix for similarity data. The second and third number repesents rows and columns of the matrix. It may consist of suffix letters “B”, “E”, or “L” to indicate the presence of labels in the matrix. Whereas, “B” indicates first element at the beginning of each row in the matrix, “E” indicates last element at the end of each row in the matrix, and “L” indicates separate list of row and/or column labels (seperated by new line for row labels and column labels) before starting the matrix lines. The fourth number is 0 if there are no missing data in the matrix. If there are missing data then the fourth number should be a 1 followed by at least one blank and then the numerical code used to denote the missing values - 999 is a popular choice. The code should not be a value that could result from standardization or other transformations of the data matrix (e.g., do not use 0 if you are going to standardize the data).
3. Row and Column Labels – labels of rows and columns. Labels must be furnished if a “B”, “E”, or “L” is placed after the numbers of rows and/or “L” after the number of columns in the matrix parameter line. When the results are plotted the underscore character “_” is displayed as a blank.
4. Matrix Data Lines – data matrix in lines. The elements of the matrix are entered row wise corresponding to one or more lines. Whereas, symmetric matrices are entered row wise in left diagonal starting with column 1 and ending with the diagonal elements (i.e., the lower half matrix). If all the elements for a row do not fit on a single line, then you may continue typing on as many new lines as needed. It is important that the first element of a new row starts on a new line—even if the previous line is mostly empty. The elements themselves are free format. Values must be separated by one or more blanks or by a comma.
Example, the input data matrix in NTSYSpc file format is below:
"Marker - Molecular weight x Locus 1 12B 6L 0 S_1 S_2 S_3 S_4 S_5 S_6 3000 0 0 1 1 0 0 2500 0 0 1 0 0 0 2400 0 0 1 1 0 0 2000 0 0 1 1 1 1 1600 1 1 1 0 0 0 1500 0 1 1 1 0 0 1000 0 0 1 1 1 1 700 0 0 0 0 1 1 600 1 0 0 0 0 0 500 1 1 0 0 0 0 400 1 1 0 0 0 0 120 0 1 0 0 0 0
More than one matrix can be stored in a single file. The records for a second matrix (starting with the optional comment lines) directly follow after those for the first.
Principal Component Analysis
Principal component analysis (a.k.a. dimensionality reduction method) is a type of ordination analysis method used to reduce high-dimensional data into low-dimensional data without eliminating the preliminary information.
The following batch file will standardize a data matrix to variables by rows, compute correlation matrix (i.e., similarity or dissimilarity indices), extract the first 3 PCA axes from the correlation matrix by computing eigenvectors and eigenvalues, project the standardized data onto these eigenvectors (i.e., feature vector), and then make a 3-dimensional (3D) plot of the projections (objects).
*stand o=data.nts r=sdata.nts d=row
*simint o=sdata.nts r=corr.nts c=corr d=row
*eigen o=corr.nts r=vect.nts val=val.nts n=3
*proj o=sdata.nts r=proj.nts f=vect.nts d=col
*mod3d o=proj.nts s=data.nts
Principal Coordinate Analysis
PCoA is an alternative to PCA which will give the same results as PCA. It converts data on distances between objects into map-based visualization for better understanding objects that are similar or dissimilar. Moreover, it allows you to identify groups or clusters. When there are fewer points than variables, the computation time will be much less than for the PCA.
The following batch file will standardize a data matrix by variables (rows), compute similarity or dissimilarity coefficient for the data matrix, factorize double-centered distance matrix, compute eigenvectors and eigenvalues corresponding to projections (objects), and then make a 3D plot of the projections in column-wise.
*stand o=data.nts r=sdata.nts d=row
*simint o=sdata.nts r=dist.nts c=corr
*dcenter o=dist.nts r=dcent.nts
*eigen o=dcent.nts r=proj.nts val=val.nts
*mod3d o=proj.ntss=data.nts d=col
Tree Plot
The phenogram (numerical taxonomy—classification based on characters) of the allelic data matrix have constructed by computing the similarity or dissimilarity matrix. Furthermore, the similarity or dissimilarity matrix have reconstructed to UPGMA (Unweighted Pair Group Method with Arithmetic mean) for generating a tree plot.
The following batch file will standardize a data matrix by rows, compute a variety of similarity and dissimilarity coefficients for the data matrix, perform clustering analysis using SAHN (Sequential, Agglomerative, Hierarchical, and Nested clustering method) and UPGMA method, and display tree plot.
*stand o=data.nts r=sdata.nts d=row
*simint o=sdata.nts r=dist.nts
*sahn o=dist.nts r=tree.nts cm=upgma
*tree o=tree.nts
Comments
Post a Comment