# Bayesian Statistics: Techniques and Models, 1.09 (V) Posterior Deviation

[MUSIC] So far, we've only drawn the model with two levels. But in reality, there's nothing that'll stop us from adding more layers. For example, instead of fixing the values for the hyper parameters in the previous segment, those hyper parameters were the mu naught, the sigma naught, the nu knot and the beta knot. We could specify just numbers for those, or we could have specified prior distributions for those variables to make this a hierarchical model. One reason we might do this is if the data are hierarchically organized so that the observations are naturally grouped together. We will examine these types of hierarchical models in depth later in the course. Another simple example of a hierarchical model is one you saw already in the previous course. Let's write it as yi given mu and sigma squared, so this is just like the model from the previous lesson, will be independent and identically distributed normal with a mean mu and a variance, sigma squared. The next step, instead of doing independent priors for mu and sigma squared, we're going to have the prior for mu depend on the value of sigma squared. That is given sigma squared, mu follows a normal distribution with mean mu naught, just some hyper parameter that you're going to chose. And the variance of this prior will be sigma squared, this parameter, divided by omega naught. Another hyper parameter that will scale it. We now have a joint distribution of y and mu given sigma squared. So finally, we need to complete the model with the prior 4 sigma squared. We'll use our standard inverse gamma with the same hyper parameters as last time. This model has three layers. And mu depends on sigma right here. The graphical representation for this model looks like this. We start with the variables that don't depend on anything else. So that would be sigma squared and move down the chain. So here, the next variable is mu which depends on sigma squared. And then dependent on both, we have the yi's. We use a double circle because the yi's are observed, their data, and we're going to assume that they're exchangeable. So let's put them on a plate here for i in 1 to n The distribution of yi depends on both mu and sigma squared, so we'll draw curves connecting those pieces there. To simulate hypothetical data from this model, we would have to first draw from the distribution of the prior for sigma squared. Then the distribution for mu which depends on sigma squared. And once we've drawn both of these, then we can draw random draws from the y's, which of course depends on both of those. With multiple levels, this is an example of a hierarchical model. Once we have a model specification, we can write out what the full posterior distribution for all the parameters given the data looks like. Remember that the numerator in Bayes' theorem is the joint distribution of all random quantities, all the nodes in this graphical representation over here from all of the layers. So for this model that we have right here, we have a joint distribution that'll look like this. We're going to write the joint distribution of everything y1 up to yn, mu and sigma squared, Using the chain rule of probability, we're going to multiply all of the distributions in the hierarchy together. So let's start with the likelihood piece. And we'll multiply that by the next layer, the distribution of mu, given sigma squared. And finally, with the prior for sigma squared. So what do these expressions right here look like? The likelihood right here in this level because they're all independent will be a product of normal densities. So we're going to multiply the normal density for each yi, Given those parameters. This, again, is shorthand right here for the density of a normal distribution. So that represents this piece right here. The conditional prior of mu given sigma squared is also a normal. So we're going to multiply this by a normal distribution of mu, where its parameters are mu naught and sigma squared over omega naught. And finally, we have the prior for sigma squared. We'll multiply by the density of an inverse gamma for sigma squared given the hyper parameters mu naught, sorry, that is given, the hyper parameters mu naught and and beta naught. What we have right here is the joint distribution of everything. It is the numerator in Bayes theorem. Let's remind ourselves really fast what Bayes theorem looks like again. We have that the posterior distribution of the parameter given the data is equal to the likelihood, Times the prior. Over the same thing again. So this gives us in the numerator the joint distribution of everything which is what we've written right here. In Bayes theorem, the numerator and the denominator are the exact same expression accept that we integrate or marginalize over all of the parameters. Because the denominator is a function of the y's only, which are known values, the denominator is just a constant number. So we can actually write the posterior distribution as being proportional to, this symbol right here represents proportional to. The joint distribution of the data and parameters, or the likelihood times the prior. The poster distribution is proportional to the joint distribution, or everything we have right here. In other words, what we have already written for this particular model is proportional to the posterior distribution of mu and sigma squared, given all of the data. The only thing missing in this expression right here is just some constant number that causes the expression to integrate to 1. If we can recognize this expression as being proportional to a common distribution, then our work is done, and we know what our posterior distribution is. This was the case for all models in the previous course. However, if we do not use conjugate priors or if the models are more complicated, then the posterior distribution will not have a standard form that we can recognize. We're going to explore a couple of examples of this issue in the next segment. [MUSIC]

[MUSIC] So far, we've only drawn the model with two levels. But in reality, there's nothing that'll stop us from adding more layers. For example, instead of fixing the values for the hyper parameters in the previous segment, those hyper parameters were the mu naught, the sigma naught, the nu knot and the beta knot. We could specify just numbers for those, or we could have specified prior distributions for those variables to make this a hierarchical model. One reason we might do this is if the data are hierarchically organized so that the observations are naturally grouped together. We will examine these types of hierarchical models in depth later in the course. Another simple example of a hierarchical model is one you saw already in the previous course. Let's write it as yi given mu and sigma squared, so this is just like the model from the previous lesson, will be independent and identically distributed normal with a mean mu and a variance, sigma squared. The next step, instead of doing independent priors for mu and sigma squared, we're going to have the prior for mu depend on the value of sigma squared. That is given sigma squared, mu follows a normal distribution with mean mu naught, just some hyper parameter that you're going to chose. And the variance of this prior will be sigma squared, this parameter, divided by omega naught. Another hyper parameter that will scale it. We now have a joint distribution of y and mu given sigma squared. So finally, we need to complete the model with the prior 4 sigma squared. We'll use our standard inverse gamma with the same hyper parameters as last time. This model has three layers. And mu depends on sigma right here. The graphical representation for this model looks like this. We start with the variables that don't depend on anything else. So that would be sigma squared and move down the chain. So here, the next variable is mu which depends on sigma squared. And then dependent on both, we have the yi's. We use a double circle because the yi's are observed, their data, and we're going to assume that they're exchangeable. So let's put them on a plate here for i in 1 to n The distribution of yi depends on both mu and sigma squared, so we'll draw curves connecting those pieces there. To simulate hypothetical data from this model, we would have to first draw from the distribution of the prior for sigma squared. Then the distribution for mu which depends on sigma squared. And once we've drawn both of these, then we can draw random draws from the y's, which of course depends on both of those. With multiple levels, this is an example of a hierarchical model. Once we have a model specification, we can write out what the full posterior distribution for all the parameters given the data looks like. Remember that the numerator in Bayes' theorem is the joint distribution of all random quantities, all the nodes in this graphical representation over here from all of the layers. So for this model that we have right here, we have a joint distribution that'll look like this. We're going to write the joint distribution of everything y1 up to yn, mu and sigma squared, Using the chain rule of probability, we're going to multiply all of the distributions in the hierarchy together. So let's start with the likelihood piece. And we'll multiply that by the next layer, the distribution of mu, given sigma squared. And finally, with the prior for sigma squared. So what do these expressions right here look like? The likelihood right here in this level because they're all independent will be a product of normal densities. So we're going to multiply the normal density for each yi, Given those parameters. This, again, is shorthand right here for the density of a normal distribution. So that represents this piece right here. The conditional prior of mu given sigma squared is also a normal. So we're going to multiply this by a normal distribution of mu, where its parameters are mu naught and sigma squared over omega naught. And finally, we have the prior for sigma squared. We'll multiply by the density of an inverse gamma for sigma squared given the hyper parameters mu naught, sorry, that is given, the hyper parameters mu naught and and beta naught. What we have right here is the joint distribution of everything. It is the numerator in Bayes theorem. Let's remind ourselves really fast what Bayes theorem looks like again. We have that the posterior distribution of the parameter given the data is equal to the likelihood, Times the prior. Over the same thing again. So this gives us in the numerator the joint distribution of everything which is what we've written right here. In Bayes theorem, the numerator and the denominator are the exact same expression accept that we integrate or marginalize over all of the parameters. Because the denominator is a function of the y's only, which are known values, the denominator is just a constant number. So we can actually write the posterior distribution as being proportional to, this symbol right here represents proportional to. The joint distribution of the data and parameters, or the likelihood times the prior. The poster distribution is proportional to the joint distribution, or everything we have right here. In other words, what we have already written for this particular model is proportional to the posterior distribution of mu and sigma squared, given all of the data. The only thing missing in this expression right here is just some constant number that causes the expression to integrate to 1. If we can recognize this expression as being proportional to a common distribution, then our work is done, and we know what our posterior distribution is. This was the case for all models in the previous course. However, if we do not use conjugate priors or if the models are more complicated, then the posterior distribution will not have a standard form that we can recognize. We're going to explore a couple of examples of this issue in the next segment. [MUSIC]