# Maximum Likelihood Estimators (MLE) --- >[!Abstract] For each sample point $\boldsymbol{x}$ , let $\hat{\theta}(\boldsymbol{x})$ be a parameter value at which $L(\theta|\boldsymbol{x})$ attains its maximum as a function of $\theta$, with $\boldsymbol{x}$ held fixed. A **maximum likelihood estimator (MLE)** of the parameter $\theta$ based on a sample $\boldsymbol{X}$ is $\hat{\theta}(\boldsymbol{X})$ (below is taken from [[Likelihood]]) Suppose that you have a stochastic process that takes discrete values (e.g., outcomes of tossing a coin 10 times, number of customers who arrive at a store in 10 minutes etc). In such cases, we can calculate the probability of observing a particular set of outcomes by making suitable assumptions about the underlying stochastic process (e.g., probability of coin landing heads is Denote the observed outcomes by $O$ and the set of parameters that describe the stochastic process as $\theta$. Thus, when we speak of probability we want to calculate $P(O|\theta)$. In other words, given specific values for $\theta$, $P(O|\theta)$ is the probability that we would observe the outcomes represented by $O$. However, when we model a real life stochastic process, we often do not know $\theta$. We simply observe $O$ and the goal then is to arrive at an estimate for $\theta$ that would be a plausible choice given the observed outcomes $O$. We know that given a value of $\theta$ the probabiliity of observing $O$ is $P(O|\theta)$. Thus, a natural estimation process is to choose that value of $\theta$ that would *maximize* the probability that we would actually observe $O$. In other words, we find the parameter values $\theta$ that maximize the following function: $ L(\theta|O) = \underset{\theta}{\mathrm{argmax}} \text{ P}(O|\theta) $ In machine learning lingo, we usually talk about data $D$ instead of observations $O$ and weights $w$ instead of a set of parameters $\theta$: $L(\theta|D) = \underset{w}{\mathrm{argmax }} \text{ P}(D|w) $ $L(\theta|D)$ is called the likelihood function. The likelihood function is by definition conditioned on the observed data $D$ and that it is a function of the unknown parameters $\theta$ ---- In the continious case there is one important difference: we can no longer talk about the probability that we observed $D$ given $\theta$ because in the continious case $P(D|\theta) = 0$. Instead we have the following continious function $f(D|\theta)$, as a result, the likelihood is given by: $L(\theta|D) = \underset{w}{\mathrm{argmax }} \text{ }f(D|w) $