# Bayesian Statistics: Techniques and Models, 1.14 (V) Monte Carlo Error and Marginalization

[MUSIC] How good is an approximation by Monte Carlo sampling? Again, we can turn to the central limit theorem, which tells us that the variance of our estimate is controlled in part by m, our sample size. If we want a better estimate, we need to choose a larger end value. For example, if we're looking for the expected value of theta, then we can take the sample mean. In the last video we call this, Theta star bar. By the central limit theorem, the sample mean approximately follows, I'm going to use this symbol to denote approximately distributed, normal, where the mean of this normal distribution is the true expected value of theta. And the variance of this distribution is the true variance of theta, Divided by our sample size m. The variance tells us how far our estimate might be from the true value. One way to approximate this variance, the theoretical variance of theta, is if we replace it with the sample variance. We could calculate the sample variance like this. Let's call it variance of theta hat. We're just approximating that integral for the variance. So this will be 1 over m. So it's just the sample mean of theta i star minus the expected value, which we're also approximating with the sample mean. So let's plug that in here. Sometimes people also divide by m minus 1, but if m is really large then it won't matter that much. [COUGH] The standard deviation of our Monte Carlo estimate is the square root of that. The square root and if we're plugging in the value, we'll have variance hat of theta divided by m under the square root. If m is large, it is reasonable to assume that the true value will likely be within about two of these standard deviations from your Monte Carlo estimate. We're going to call this, the standard error of your Monte Carlo estimate. We can also obtain Monte Carlo samples from hierarchical models. As a simple example, let's go ahead and consider a binomial random variable y. So y given phi is going to follow a binomial distribution with 10 trials and on each trial a probability of success that's equal to phi. Let's further assume that phi comes from its own distribution, that it's random. So this would be, for example, if we had a prior for phi. Let's say it comes from a beta distribution with parameters 2 and 2. Given any hierarchical model, we can always write out the joint distribution of y and phi. The joint distribution of y and phi using the chain rule of probability will be the distribution of phi times the distribution of y given phi. This should look familiar, it's like the prior times the likelihood. To simulate from this joint distribution, we're going to repeat the following steps for a large number of samples, m. The first step to simulate from this. The first step we're going to take to simulate would be to draw a theta star i. From its beta distribution. Then the second step is given this value that we just drew for phi. We would draw a sample from this binomial distribution. We'll call this sample y star i, from this binomial distribution, with 10 trials. And the success probability would be the success probability we just drew, phi i star. If we repeat this process for many samples, we're going to produce m independent pairs. Yi phi i pairs. These pairs right here are drawn from their joint distribution. One major advantage of Monte Carlo simulation is that marginalizing these distributions is easy. Calculating the marginal distribution of y might be difficult here. It would require that we integrate this expression with respect to phi, to integrate out the phis. But, if we have draws from the joint distribution, then we can just discard the phi i stars and use the y i stars as samples from their marginal distribution. This is also called prior predictive distributions which was introduced in the previous course. In the next segment, we're going to demonstrate some of these principles. Remember, we don't yet know how to sample from complicated posterior distributions that were introduced in the previous lesson. But once we learn that, we're going to be able to use the principles from this lesson to make approximate inferences for those posterior distributions. [MUSIC]

[MUSIC] How good is an approximation by Monte Carlo sampling? Again, we can turn to the central limit theorem, which tells us that the variance of our estimate is controlled in part by m, our sample size. If we want a better estimate, we need to choose a larger end value. For example, if we're looking for the expected value of theta, then we can take the sample mean. In the last video we call this, Theta star bar. By the central limit theorem, the sample mean approximately follows, I'm going to use this symbol to denote approximately distributed, normal, where the mean of this normal distribution is the true expected value of theta. And the variance of this distribution is the true variance of theta, Divided by our sample size m. The variance tells us how far our estimate might be from the true value. One way to approximate this variance, the theoretical variance of theta, is if we replace it with the sample variance. We could calculate the sample variance like this. Let's call it variance of theta hat. We're just approximating that integral for the variance. So this will be 1 over m. So it's just the sample mean of theta i star minus the expected value, which we're also approximating with the sample mean. So let's plug that in here. Sometimes people also divide by m minus 1, but if m is really large then it won't matter that much. [COUGH] The standard deviation of our Monte Carlo estimate is the square root of that. The square root and if we're plugging in the value, we'll have variance hat of theta divided by m under the square root. If m is large, it is reasonable to assume that the true value will likely be within about two of these standard deviations from your Monte Carlo estimate. We're going to call this, the standard error of your Monte Carlo estimate. We can also obtain Monte Carlo samples from hierarchical models. As a simple example, let's go ahead and consider a binomial random variable y. So y given phi is going to follow a binomial distribution with 10 trials and on each trial a probability of success that's equal to phi. Let's further assume that phi comes from its own distribution, that it's random. So this would be, for example, if we had a prior for phi. Let's say it comes from a beta distribution with parameters 2 and 2. Given any hierarchical model, we can always write out the joint distribution of y and phi. The joint distribution of y and phi using the chain rule of probability will be the distribution of phi times the distribution of y given phi. This should look familiar, it's like the prior times the likelihood. To simulate from this joint distribution, we're going to repeat the following steps for a large number of samples, m. The first step to simulate from this. The first step we're going to take to simulate would be to draw a theta star i. From its beta distribution. Then the second step is given this value that we just drew for phi. We would draw a sample from this binomial distribution. We'll call this sample y star i, from this binomial distribution, with 10 trials. And the success probability would be the success probability we just drew, phi i star. If we repeat this process for many samples, we're going to produce m independent pairs. Yi phi i pairs. These pairs right here are drawn from their joint distribution. One major advantage of Monte Carlo simulation is that marginalizing these distributions is easy. Calculating the marginal distribution of y might be difficult here. It would require that we integrate this expression with respect to phi, to integrate out the phis. But, if we have draws from the joint distribution, then we can just discard the phi i stars and use the y i stars as samples from their marginal distribution. This is also called prior predictive distributions which was introduced in the previous course. In the next segment, we're going to demonstrate some of these principles. Remember, we don't yet know how to sample from complicated posterior distributions that were introduced in the previous lesson. But once we learn that, we're going to be able to use the principles from this lesson to make approximate inferences for those posterior distributions. [MUSIC]