% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/btm.R
\name{BTM}
\alias{BTM}
\title{Construct a Biterm Topic Model on Short Text}
\usage{
BTM(data, k = 5, alpha = 50/k, beta = 0.01, iter = 1000, window = 15,
  background = FALSE, trace = FALSE)
}
\arguments{
\item{data}{a tokenised data frame containing one row per token with 2 columns 
\itemize{
\item the first column is a context identifier (e.g. a tweet id, a document id, a sentence id, an identifier of a survey answer, an identifier of a part of a text)
\item the second column is a column called of type character containing the sequence of words occurring within the context identifier 
}}

\item{k}{integer with the number of topics to identify}

\item{alpha}{numeric, indicating the symmetric dirichlet prior probability of a topic P(z). Defaults to 50/k.}

\item{beta}{numeric, indicating the symmetric dirichlet prior probability of a word given the topic P(w|z). Defaults to 0.1.}

\item{iter}{integer with the number of iterations of Gibbs sampling}

\item{window}{integer with the window size for biterm extraction. Defaults to 15.}

\item{background}{logical if set to \code{TRUE}, the first topic is set to a background topic that 
equals to the empirical word distribution. This can be used to filter out common words. Defaults to FALSE.}

\item{trace}{logical indicating to print out evolution of the Gibbs sampling iterations. Defaults to FALSE.}
}
\value{
an object of class BTM which is a list containing
\itemize{
\item{model: a pointer to the C++ BTM model}
\item{K: the number of topics}
\item{W: the number of tokens in the data}
\item{alpha: the symmetric dirichlet prior probability of a topic P(z)}
\item{beta: the symmetric dirichlet prior probability of a word given the topic P(w|z)}
\item{iter: the number of iterations of Gibbs sampling}
\item{background: indicator if the first topic is set to the background topic that equals the empirical word distribution.}
\item{theta: a vector with the topic probability p(z) which is determinated by the overall proportions of biterms in it}
\item{phi: a matrix of dimension W x K with one row for each token in the data. This matrix contains the probability of the token given the topic P(w|z).
the rownames of the matrix indicate the token w}
}
}
\description{
The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms)

\itemize{
\item A biterm consists of two words co-occurring in the same context, for example, in the same short text window. 
\item BTM models the biterm occurrences in a corpus (unlike LDA models which model the word occurrences in a document). 
\item It's a generative model. In the generation procedure, a biterm is generated by drawing two words independently from a same topic z. 
In other words, the distribution of a biterm \eqn{b=(wi,wj)} is defined as: \eqn{P(b) = \sum_k{P(wi|z)*P(wj|z)*P(z)}} 
where k is the number of topics you want to extract.
\item Estimation of the topic model is done with the Gibbs sampling algorithm. Where estimates are provided for \eqn{P(w|k)=phi} and \eqn{P(z)=theta}.
}
}
\note{
A biterm is defined as a pair of words co-occurring in the same text window. 
If you have as an example a document with sequence of words \code{'A B C B'}, and assuming the window size is set to 3, 
that implies there are two text windows which can generate biterms namely 
text window \code{'A B C'} with biterms \code{'A B', 'B C', 'A C'} and text window \code{'B C B'} with biterms \code{'B C', 'C B', 'B B'}
A biterm is an unorder word pair where \code{'B C' = 'C B'}. Thus, the document \code{'A B C B'} will have the following biterm frequencies: \cr
\itemize{
\item 'A B': 1 
\item 'B C': 3
\item 'A C': 1
\item 'B B': 1
}
These biterms are used to create the model.
}
\examples{
library(udpipe)
data("brussels_reviews_anno", package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos \%in\% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]
model  <- BTM(x, k = 5, alpha = 1, beta = 0.01, iter = 10, trace = TRUE)
model
terms(model)
scores <- predict(model, newdata = x)

## Another small run with first topic the background word distribution
set.seed(123456)
model  <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE)
model
terms(model)
}
\references{
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013,
\url{https://github.com/xiaohuiyan/BTM}, \url{https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf}
}
\seealso{
\code{\link{predict.BTM}}, \code{\link{terms.BTM}}, \code{\link{logLik.BTM}}
}
