"Graphical models are a marriage between probability theory andgraph theory. They provide a natural tool for dealing with two problemsthat occur throughout applied mathematics and engineering --uncertainty and complexity -- and in particular they are playing anincreasingly important role in the design and analysis of machinelearning algorithms. Fundamental to the idea of a graphical model isthe notion of modularity -- a complex system is built by combiningsimpler parts. Probability theory provides the glue whereby the partsare combined, ensuring that the system as a whole is consistent, andproviding ways to interface models to data. The graph theoretic sideof graphical models provides both an intuitively appealing interfaceby which humans can model highly-interacting sets of variables as wellas a data structure that lends itself naturally to the design ofefficient general-purpose algorithms.
Many of the classical multivariate probabalistic systems studied infields such as statistics, systems engineering, information theory,pattern recognition and statistical mechanics are special cases of thegeneral graphical model formalism -- examples include mixture models,factor analysis, hidden Markov models, Kalman filters and Isingmodels. The graphical model framework provides a way to view all ofthese systems as instances of a common underlying formalism. This viewhas many advantages -- in particular, specialized techniques that havebeen developed in one field can be transferred between researchcommunities and exploited more widely. Moreover, the graphical modelformalism provides a natural framework for the design of new systems."--- Michael Jordan, 1998.
Note: (a version of) this page is availablein pdf formathere.Also, Marie Stefanova has madea Swedish translationhere.
.)This can be used as a guide to construct the graph structure.In addition, directed models can encode deterministicrelationships, and are easier to learn (fit to data).In the rest of this tutorial, we will only discuss directed graphicalmodels, i.e., Bayesian networks.
In addition to the graph structure, it is necessary to specify theparameters of the model.For a directed model, we must specifythe Conditional Probability Distribution (CPD) at each node.If the variables are discrete, this can be represented as a table(CPT), which lists the probability that the child node takes on eachof its different values for each combination of values of itsparents. Consider the following example, in which all nodes are binary,i.e., have two possible values, which we will denote by T (true) andF (false).
We see that the event "grass is wet" (W=true) has two possible causes: either the water sprinker is on (S=true) or it israining (R=true).The strength of this relationship is shown in the table.For example, we see that Pr(W=true | S=true, R=false) = 0.9 (secondrow), andhence, Pr(W=false | S=true, R=false) = 1 - 0.9 = 0.1, since each rowmust sum to one.Since the C node has no parents, its CPT specifies the priorprobability that it is cloudy (in this case, 0.5).(Think of C as representing the season:if it is a cloudy season, it is less likely that the sprinkler is onand more likely that the rain is on.)
The simplest conditional independence relationship encoded in a Bayesiannetwork can be stated as follows:a node is independent of its ancestors given its parents, where theancestor/parent relationship is with respect to some fixed topologicalordering of the nodes.
By the chain rule of probability,the joint probability of all the nodes in the graph above is
P(C, S, R, W) = P(C) * P(S|C) * P(R|C,S) * P(W|C,S,R)By using conditional independence relationships, we can rewrite this as
P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R)where we were allowed to simplify the third term because R isindependent of S given its parent C, and the last term because W isindependent of C given its parents S and R.
We can see that the conditional independence relationshipsallow us to represent the joint more compactly.Here the savings are minimal, but in general, if we had n binarynodes, the full joint would require O(2^n) space to represent, but thefactored form would require O(n 2^k) space to represent, where k isthe maximum fan-in of a node. And fewer parameters makes learning easier.
联系客服