% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/divide.R
\name{divide}
\alias{divide}
\title{Divide a Distributed Data Object}
\usage{
divide(data, by = NULL, spill = 1000000, filterFn = NULL, bsvFn = NULL,
  output = NULL, overwrite = FALSE, preTransFn = NULL,
  postTransFn = NULL, params = NULL, packages = NULL, control = NULL,
  update = FALSE, verbose = TRUE)
}
\arguments{
\item{data}{an object of class "ddf" or "ddo" - in the latter case, need to specify \code{preTransFn} to coerce each subset into a data frame}

\item{by}{specification of how to divide the data - conditional (factor-level or shingles), random replicate, or near-exact replicate (to come) -- see details}

\item{spill}{integer telling the division method how many lines of data should be collected until spilling over into a new key-value pair}

\item{filterFn}{a function that is applied to each candidate output key-value pair to determine whether it should be (if returns \code{TRUE}) part of the resulting division}

\item{bsvFn}{a function to be applied to each subset that returns a list of between subset variables (BSVs)}

\item{output}{a "kvConnection" object indicating where the output data should reside (see \code{\link{localDiskConn}}, \code{\link{hdfsConn}}).  If \code{NULL} (default), output will be an in-memory "ddo" object.}

\item{overwrite}{logical; should existing output location be overwritten? (also can specify \code{overwrite = "backup"} to move the existing output to _bak)}

\item{preTransFn}{a transformation function (if desired) to applied to each subset prior to division - note: this is deprecated - instead use \code{\link{addTransform}} prior to calling divide}

\item{postTransFn}{a transformation function (if desired) to apply to each post-division subset}

\item{params}{a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify)}

\item{packages}{a vector of R package names that contain functions used in \code{fn} (most should be taken care of automatically such that this is rarely necessary to specify)}

\item{control}{parameters specifying how the backend should handle things (most-likely parameters to \code{rhwatch} in RHIPE) - see \code{\link{rhipeControl}} and \code{\link{localDiskControl}}}

\item{update}{should a MapReduce job be run to obtain additional attributes for the result data prior to returning?}

\item{verbose}{logical - print messages about what is being done}
}
\value{
an object of class "ddf" if the resulting subsets are data frames.  Otherwise, an object of class "ddo".
}
\description{
Divide a ddo/ddf object into subsets based on different criteria
}
\details{
The division methods this function will support include conditioning variable division for factors (implemented -- see \code{\link{condDiv}}), conditioning variable division for numerical variables through shingles, random replicate (implemented -- see \code{\link{rrDiv}}), and near-exact replicate.  If \code{by} is a vector of variable names, the data will be divided by these variables.  Alternatively, this can be specified by e.g.  \code{condDiv(c("var1", "var2"))}.
}
\examples{
# divide iris data by Species by passing in a data frame
bySpecies <- divide(iris, by = "Species")
bySpecies

# divide iris data into random partitioning of ~30 rows per subset
irisRR <- divide(iris, by = rrDiv(30))
irisRR

# any ddf can be passed into divide:
irisRR2 <- divide(bySpecies, by = rrDiv(30))
irisRR2
bySpecies2 <- divide(irisRR2, by = "Species")
bySpecies2

# splitting on multiple columns
byEdSex <- divide(adult, by = c("education", "sex"))
byEdSex
byEdSex[[1]]

# splitting on a numeric variable
bySL <- ddf(iris) \%>\%
  addTransform(function(x) {
    x$slCut <- cut(x$Sepal.Length, 10)
    x
  }) \%>\%
  divide(by = "slCut")
bySL
bySL[[1]]
}
\author{
Ryan Hafen
}
\references{
\itemize{
 \item \url{http://tessera.io}
 \item \href{http://onlinelibrary.wiley.com/doi/10.1002/sta4.7/full}{Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., & Cleveland, W. S. (2012). Large complex data: divide and recombine (D&R) with RHIPE. \emph{Stat}, 1(1), 53-67.}
}
}
\seealso{
\code{\link{recombine}}, \code{\link{ddo}}, \code{\link{ddf}}, \code{\link{condDiv}}, \code{\link{rrDiv}}
}

