% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/build-moc.R
\name{buildMOC}
\alias{buildMOC}
\title{Build Matrix-Of-Clusters}
\usage{
buildMOC(
  data,
  M,
  K = NULL,
  maxK = 10,
  methods = "hclust",
  distances = "euclidean",
  fill = FALSE,
  computeAccuracy = FALSE,
  fullData = FALSE,
  savePNG = FALSE,
  fileName = "buildMOC",
  widestGap = FALSE,
  dunns = FALSE,
  dunn2s = FALSE
)
}
\arguments{
\item{data}{List of M datasets, each of size N X P_m, where m = 1, ..., M.}

\item{M}{Number of datasets.}

\item{K}{Vector containing the number of clusters in each dataset. If given
an integer instead of a vector it is assumed that each dataset has the same
number of clusters. If NULL, it is assumed that the true cluster numbers are
not known, therefore they will be estimated using the silhouette method.}

\item{maxK}{Vector of maximum cluster numbers to be considered for each
dataset if K is NULL. If given an integer instead of a vector it is assumed
that for each dataset the same maximum number of clusters must be considered.
Default is 10.}

\item{methods}{Vector of strings containing the names of the clustering
methods to be used to cluster the observations in each dataset. Each can be
"kmeans" (k-means clustering), "hclust" (hierarchical clustering), or "pam"
(partitioning around medoids). If the vector is of length one, the same
clustering method is applied to all the datasets. Default is "hclust".}

\item{distances}{Distances to be used in the clustering step for each
dataset. If only one string is provided, then the same distance is used for
all datasets. If the number of strings provided is the same as the number of
datasets, then each distance will be used for the corresponding dataset.
Default is "euclidean". Please note that not all distances are compatible
with all clustering methods. "euclidean" and "manhattan" work with all
available clustering algorithms. "gower" distance is only available for
partitioning around medoids. In addition, "maximum", "canberra", "binary" or
"minkowski" are available for k-means and hierarchical clustering.}

\item{fill}{Boolean. If TRUE, if there are any missing observations in one or
more datasets, the corresponding cluster labels will be estimated through
generalised linear models on the basis of the available labels.}

\item{computeAccuracy}{Boolean. If TRUE, for each missing element, the
performance of the predictive model used to estimate the corresponding
missing label is computer.}

\item{fullData}{Boolean. If TRUE, the full data matrices are used to estimate
the missing cluster labels (instead of just using the cluster labels of the
corresponding datasets).}

\item{savePNG}{Boolean. If TRUE, plots of the silhouette for each datasets
are saved as png files. Default is FALSE.}

\item{fileName}{If \code{savePNG} is TRUE, this is the string containing the
name of the output files. Can be used to specify the folder path too. Default
is "buildMOC". The ".png" extension is automatically added to this string.}

\item{widestGap}{Boolean. If TRUE, compute also widest gap index to choose
best number of clusters. Default is FALSE.}

\item{dunns}{Boolean. If TRUE, compute also Dunn's index to choose best
number of clusters. Default is FALSE.}

\item{dunn2s}{Boolean. If TRUE, compute also alternative Dunn's index to
choose best number of clusters. Default is FALSE.}
}
\value{
This function returns a list containing:
\item{moc}{the Matrix-Of-Clusters, a binary matrix of size N x sum(K)
where element (n,k) contains a 1 if observation n belongs to the
corresponding cluster, 0 otherwise.}
\item{datasetIndicator}{a vector of length sum(K) in which
each element is the number of the dataset to which the cluster belongs.}
\item{number_nas}{the total number of NAs in the matrix of clusters. (If the
MOC has been filled with imputed values, \code{number_nas} indicates the
number of NAs in the original MOC.)}
\item{clLabels}{a matrix that is equivalent to the matrix of clusters, but is
in compact form, i.e. each column corresponds to a dataset, each row
represents an observation, and its values indicate the cluster labels.}
\item{K}{vector of cluster numbers in each dataset. If these are provided as
input, this is the same as the input (expanded to a vector if the input is an
integer). If the cluster numbers are not provided as input, this vector
contains the cluster numbers chosen via silhouette for each dataset.}
}
\description{
This function creates a matrix of clusters starting from a list of
heterogeneous datasets.
}
\examples{
# Load data
data <- list()
data[[1]] <- as.matrix(read.csv(system.file("extdata", "dataset1.csv",
package = "coca"), row.names = 1))
data[[2]] <- as.matrix(read.csv(system.file("extdata", "dataset2.csv",
package = "coca"), row.names = 1))
data[[3]] <- as.matrix(read.csv(system.file("extdata", "dataset3.csv",
package = "coca"), row.names = 1))

# Build matrix of clusters
outputBuildMOC <- buildMOC(data, M = 3, K = 6, distances = "cor")

# Extract matrix of clusters
matrixOfClusters <- outputBuildMOC$moc

}
\references{
The Cancer Genome Atlas, 2012. Comprehensive molecular portraits
of human breast tumours. Nature, 487(7407), pp.61–70.

Rousseeuw, P.J., 1987. Silhouettes: a graphical aid to the
interpretation and validation of cluster analysis. Journal of computational
and applied mathematics, 20, pp.53-65.
}
\author{
Alessandra Cabassi \email{alessandra.cabassi@mrc-bsu.cam.ac.uk}
}
