So far, we have looked at discrete and continuous probability distributions, but we have only focused on one random variable at a time. Imagine a situation in which we collect more than one measurement from each member of a population. For example, we might measure the height, weight, shoe size, GPA, and age of a group of students. That collection of measurements would give rise to five different random variables, some of which might be independent (shoe size and GPA, for example) and some of which might be dependent (height and weight, for example). In this situation, the likelihood of any particular combination of measurement values would be given by a joint probability distribution, either a joint probability mass function (PMF) for discrete measurements, or a joint probability density function (PDF) for continuous measurements.

Joint probabilities distributions arise very often in situations where the variation of one variable affects the value of another variable. In an engineering context, one variable could represent deviations from a component's nominal dimensions while the other variable could represent the performance of the assembly containing the component. In that case, it could be very helpful to understand how strongly the dimensions affect the performance -- as measured by the correlation coefficient. It could also be helpful to understand the probability that a component will fail -- as measured by the marginal probability. The expected value and variance could be helpful in comparing the dimensions of components purchased from different vendors, and the conditional probability could be helpful in answering if-then questions like, "if the component dimensions are within the specified tolerances, what is the probability that the assembly will not fail?" Each of these cases will be discussed in this section. We will deal initially with discrete cases, which may be the most common in engineering, leaving continuous cases for the next chapter. When the random variables are continuous, the summations become integrals, but the overall concepts are the same.

Joint Probability Mass Functions (PMF)

We will restrict ourselves to 2-dimensional distributions so that the probability functions can be easily displayed on the page, but higher dimensions (more than two variables) are also possible. In earlier sections, we have used only \(X\) to represent the random variable, we now have both \(X\) and \(Y\) as the pair of random variables. Once the theory is understood for two random variables, the extension to \(n\) random variables is straightforward.

A joint probability mass function representing the probability that events \(x\) and \(y\) occur at the same time would be defined in the form below:
\begin{align}%\label{}
\nonumber p_{XY}(x,y)=\textrm{Pr}[X=x, Y=y].
\end{align}

we could also replace the comma with the word "and" to say
\begin{align}%\label{}
\nonumber p_{XY}(x,y)= \textrm{Pr}\big[\;(X=x)\textrm{ and }(Y=y)\;\big].
\end{align}

The Cumulative Distribution Function (CDF) for a joint probability distribution is given by:
\begin{align}%\label{}
\nonumber F_{XY}(x,y)= \textrm{Pr}\big[\;(X\le x)\textrm{ and }(Y\le y)\;\big].
\end{align}

Testing for Validity

Note that no probabilities may be negative, and the total probability must still equal one, so for the joint probability mass function to be valid we have two criteria:
\begin{align}%\label{}
\nonumber 0 \le p_{XY}(x,y) &\le 1 \\ \textrm{ } \\
\nonumber \sum_{\textrm{ all } x} \sum_{\textrm{ all } y} p_{XY}(x,y)&=1
\end{align}

For the joint cumulative distribution function to be valid we have three criteria:
\(F_{XY}=0\) at the bottom corner of the joint range. In the 2D case, \(F_{XY}(0,0)=0\)
\(F_{XY}=1\) at the top corner of the joint range. In the 2D case, \(F_{XY}(1,1)=1\)
\(F_{XY}\) must be non-decreasing. In other words, as either \(x\) or \(y\) increase, \(F_{XY}\) must either increase or stay the same.

Tabular Representation

A discrete probability function in a single variable may be tabulated in a single column. A discrete joint probability distribution can be tabulated in the same way using both rows and columns. The table below represents the generalized joint probability distribution for two variables where the first variable has four possible outcomes and the second variable has two.

\(p_{XY}(x,y)\) \(x=1\) \(x=2\) \(x=3\) \(x=4\) Row Totals
\(y=1\) \(a\) \(b\) \(c\) \(d\) \(α\)
\(y=2\) \(e\) \(f\) \(g\) \(h\) \(β\)
Column Totals \(γ\) \(δ\) \(ε\) \(ζ\) \(1\)

The letters \(a\) through \(h\) represent the joint probabilities of the different events formed from the combinations of \(x\) and \(y\) while the Greek letters \(α\) through \(ζ\) represent the totals, which are also referred to as the marginal probability mass functions. More formally, the marginal PMFs of \(X\) and \(Y\) are given by:

Marginal probability mass functions of \(X\) and \(Y\)

\begin{align}\label{Eq:marginals}
\nonumber p_{X}(x)&=\sum_{\textrm{ all } y} p_{XY}(x,y_j), \hspace{20pt} \textrm{ for any } x \in \{x_1,x_2,... \} \\ \textrm{ } \\
p_{Y}(y)&=\sum_{\textrm{ all } x} p_{XY}(x_i,y), \hspace{20pt} \textrm{ for any } y \in \{y_1,y_2,... \}
\end{align}

The marginal PMF gets its name because the value is written in the margin of the table.

The example below represents the specific situation when variable \(X\) corresponds to a die being rolled and variable \(Y\) corresponds to a coin being tossed. Note that the marginal probability distribution functions are 1⁄2 and 1⁄6, as expected for coin flipping and die rolling. The marginal PMF given by \(p_Y(x=2)\), for example, would be \(\frac{1}{12} + \frac{1}{12} = \frac{1}{6}\) because the probability of rolling a 2 is 1⁄6. Similarly, the marginal PMF given by \(p_X(y=0)\) would be \(\frac{1}{12} + \frac{1}{12} + \frac{1}{12} + \frac{1}{12} + \frac{1}{12} + \frac{1}{12} = \frac{1}{2}\), which makes sense given that tails occur half the time.

Discrete, Independent Example: Rolling a Die and Flipping a Coin

\(p_{XY}(x,y)\) \(x=1\) \(x=2\) \(x=3\) \(x=4\) \(x=5\) \(x=6\) Row Totals
\(\textrm{ Tails: } y=0\) 1⁄12 1⁄12 1⁄12 1⁄12 1⁄12 1⁄12 1⁄2
\(\textrm{ Heads: } y=1\) 1⁄12 1⁄12 1⁄12 1⁄12 1⁄12 1⁄12 1⁄2
Column Totals 1⁄6 1⁄6 1⁄6 1⁄6 1⁄6 1⁄6 1

Testing For Independence

Recall that for two events, A and B, to be independent we have \(\textrm{Pr}[A \cap B]=\textrm{Pr}[A]\cdot \textrm{Pr}[B]\). For joint probability distributions this amounts to \(p_{XY}=p_X \cdot p_Y\) for every combination of \(x_i\) and \(y_j\). This is not the case for the example below, in which one variable measures the sum of two dice and the other measures the value of the first die. Most of the marginal probabilities do not multiply to equal the PMF value in the same row and column. For example, \(p_{XT}(5,11)=\frac{1}{36}\), but \(p_X(5) \cdot p_T(11) = \frac{1}{6} \cdot \frac{2}{36} \ne \frac{1}{36}\) In fact, several PMF entries are zero, so if the variables were independent, the product of the corresponding marginal PMFs would need to be zero. Since none of the marginal PMFs are zero, that is impossible, and the variables must be dependent.

Discrete, Dependent Example: Rolling Two Dice and Checking the Sum

\(X\textrm{ \ }T\) \(2\) \(3\) \(4\) \(5\) \(6\) \(7\) \(8\) \(9\) \(10\) \(11\) \(12\) \(p_{X}(x_i)\)
\(1\) 1⁄36 1⁄36 1⁄36 1⁄36 1⁄36 1⁄36 0 0 0 0 0 1⁄6
\(2\) 0 1⁄36 1⁄36 1⁄36 1⁄36 1⁄36 1⁄36 0 0 0 0 1⁄6
\(3\) 0 0 1⁄36 1⁄36 1⁄36 1⁄36 1⁄36 1⁄36 0 0 0 1⁄6
\(4\) 0 0 0 1⁄36 1⁄36 1⁄36 1⁄36 1⁄36 1⁄36 0 0 1⁄6
\(5\) 0 0 0 0 1⁄36 1⁄36 1⁄36 1⁄36 1⁄36 1⁄36 0 1⁄6
\(6\) 0 0 0 0 0 1⁄36 1⁄36 1⁄36 1⁄36 1⁄36 1⁄36 1⁄6
\(p_{T}(t_j)\) 1⁄36 2⁄36 3⁄36 4⁄36 5⁄36 6⁄36 5⁄36 4⁄36 3⁄36 2⁄36 1⁄36 1

Expected Values and Variances

Recall that the expected value of a random variable is the sum of every outcome multiplied by its probability of occurring. This definition applies to joint probability distributions as well, with every outcome multiplied by its corresponding marginal PMF. The definition of the variance is adjusted in the same way, with each marginal PMF multiplied by the square of the value of the random variable. As always, the standard deviations \(\sigma_X\) and \(\sigma_Y\) are the square roots of their respective variances.

Expected Value and Variance of \(X\) and \(Y\)

\begin{align}\label{Eq:EVandVar}
\nonumber E(X) &=\sum_{\textrm{ all } x}\; x\cdot p_{X}(x), \\ \textrm{ } \\
\textrm{Var}(X) &= \sum_{\textrm{ all } x}\; x^2\cdot p_{X}(x)-\big(E(X)\big)^2 \\ \textrm{ } \\
\textrm{ for any } x& \in \{x_1,x_2,... \} \\ \textrm{ } \\ \textrm{ } \\
E(Y) &=\sum_{\textrm{ all } y} \;y\cdot p_{Y}(y), \\ \textrm{ } \\
\textrm{Var}(Y) &= \sum_{\textrm{ all } y}\; y^2\cdot p_{Y}(y)-\big(E(Y)\big)^2 \\ \textrm{ } \\
\textrm{ for any } y& \in \{y_1,y_2,... \}
\end{align}

Referring to the example above with \(T\) measuring the sum of two dice and \(X\) measuring the value of the first die, for the expected value of \(T\) we would have
\begin{align}\label{Eq:EV}
\nonumber E(T) &=\sum_{\textrm{ all } t}\; t\cdot p_{T}(t) \\ \textrm{ } \\
E(T) &=2\big(\tfrac{1}{36}\big)+3\big(\tfrac{2}{36}\big)+4\big(\tfrac{3}{36}\big)+5\big(\tfrac{4}{36}\big)+6\big(\tfrac{5}{36}\big)+7\big(\tfrac{6}{36}\big)+8\big(\tfrac{5}{36}\big)+9\big(\tfrac{4}{36}\big)+10\big(\tfrac{3}{36}\big)+11\big(\tfrac{2}{36}\big)+12\big(\tfrac{1}{36}\big) \\ \textrm{ } \\
E(T) &=7
\end{align}
For the variance of \(T\), we would have
\begin{align}\label{Eq:Var}
\nonumber \textrm{Var}(T) &=\sum_{\textrm{ all } t}\; t^2\cdot p_{T}(t)-\big(E(T)\big)^2 \\ \textrm{ } \\
\textrm{Var}(T) &=\sum_{\textrm{ all } t}\; t^2\cdot p_{T}(t)-7^2 \\ \textrm{ } \\
\textrm{Var}(T) &=2^2\big(\tfrac{1}{36}\big)+3^2\big(\tfrac{2}{36}\big)+4^2\big(\tfrac{3}{36}\big)+5^2\big(\tfrac{4}{36}\big)+...+11^2\big(\tfrac{2}{36}\big)+12^2\big(\tfrac{1}{36}\big)-49 \\ \textrm{ } \\
\textrm{Var}(T) &=54.83-49=5.83
\end{align}

Covariance

We found earlier that we can tell if two random variables are independent by comparing the probability mass function with the corresponding product of the marginal probabilities. If random variables \(X\) and \(Y\) are independent, then \(p_{XY}=p_X \cdot p_Y\) for every combination of \(x_i\) and \(y_j\). When two random variables are not independent, we can measure the relationship between them using the covariance, as defined below.
\begin{align}\label{Eq:cov}
\nonumber \textrm{Var}(X) &= E\big((X-\mu_X)^2\big) \\ \textrm{ } \\
\textrm{Cov}(X,Y) &= E\big((X-\mu_X)(Y-\mu_Y)\big)
\end{align}

The covariance calculation assumes that the two random variables are the same size and that it is sensible to express them as ordered pairs. This is very often the case when collecting multiple measures from the same pool of subjects. Just like the variance, the covariance has units that are not particularly helpful. With the variance, we took the square root to form the standard deviation, which has the same units as the original measurements. With the covariance, we divide by the standard deviations of \(X\) and \(Y\) to eliminate the units entirely. The resulting ratio is called the correlation coefficient.

Covariance and Correlation between random variables \(X\) and \(Y\)

\begin{align}\label{Eq:cov_cor}
\nonumber \textrm{Cov}(X,Y) &= \sum_{\textrm{ all } x} \sum_{\textrm{ all } y} \; xy \cdot p_{XY}(x,y) - E(X)\cdot E(Y) \\ \textrm{ } \\
\rho_{XY} &= \textrm{Cor}(X,Y) = \displaystyle \frac{\textrm{ Cov }(X,Y)}{\sigma_X \cdot \sigma_Y}
\end{align}

The correlation coefficient \(\rho_{XY}\) will always have values between -1 and +1.

The formula for the covariance of \(X\) and \(Y\) is analogous to the variance of\(X\), only instead of multiplying \(x^2\) by \(P_X\) and subtracting \(E(X)^2\), we multiply \(xy\) by \(P_{XY}\) and subtract \(E(X)\cdot E(Y)\). In fact, the covariance of \(X\) with itself is exactly the variance: \(\textrm{Cov}(X,X)=\textrm{Var}(X)\).

When the correlation coefficient is close to zero, there is no linear relationship between the variables. When the correlation coefficient is positive, then there is a direct relationship (when one variable increases, the other increases as well). When the correlation coefficient is negative, then there is an inverse relationship (when one variable increases, the other variable decreases). In other words, on a scatter plot of \(Y\) vs. \(X\), the sign of the correlation coefficient matches the sign of the slope of a line through the data. The closer the correlation coefficient is to +1 or -1, the stronger the linear relationship between the two variables.

Warning

If the covariance is not zero, then the variables are not independent, but the opposite isn't true.

If the covariance is zero, that does not mean the variables are independent!

A covariance of zero, or a correlation coefficient of zero, just means that there is no linear relationship between the variables. The variables could very easily be dependent and related quadratically, or logarithmically, or have any other nonlinear relationship.

A Visual Explanation of Covariance That Might Possibly be More Clear

Monica Cellio published this explanation of covariance to Stack Exchange on 11/10/2011. It might help build a useful intuition for the parameter.

Given paired \((x,y)\) data, draw their scatterplot. Each pair of points \((x_i,y_i)\), \((x_j,y_j)\) in that plot determines a rectangle: it's the smallest rectangle, whose sides are parallel to the axes, containing those points. Thus the points are either at the upper right and lower left corners (a "positive" relationship) or they are at the upper left and lower right corners (a "negative" relationship).

Draw all possible such rectangles. Color them transparently, making the positive rectangles red (say) and the negative rectangles "anti-red" (blue). In this fashion, wherever rectangles overlap, their colors are either enhanced when they are the same (blue and blue or red and red) or cancel out when they are different.

The covariance is the net amount of red in the plot (treating blue as negative values).

Here are some examples with 32 binormal points drawn from distributions with the given covariances, ordered from most negative (bluest) to most positive (reddest).

They are drawn on common axes to make them comparable. The rectangles are lightly outlined to help you see them.

Let's deduce some properties of covariance. Understanding of these properties will be accessible to anyone who has actually drawn a few of the rectangles. 🙂

Bilinearity. Because the amount of red depends on the size of the plot, covariance is directly proportional to the scale on the x-axis and to the scale on the y-axis. (Note: when we divide the covariance by the standard deviations in x and y to get the correlation coefficient, we effectively scale all of the plots to be the same size and allow them to be compared to each other)

Correlation. Covariance increases as the points approximate an upward sloping line and decreases as the points approximate a downward sloping line. This is because in the former case most of the rectangles are positive and in the latter case, most are negative.

Relationship to linear associations. Because non-linear associations can create mixtures of positive and negative rectangles, they lead to unpredictable (and not very useful) covariances. Linear associations can be fully interpreted by means of the preceding two characterizations: bilinearity and correlation.

Sensitivity to outliers. A geometric outlier (one point standing away from the mass) will create many large rectangles in association with all the other points. It alone can create a net positive or negative amount of red in the overall picture.

Interpreting Covariance

Covariance and the correlation coefficient measure the strength of the linear relationship between the variables. Just because the covariance is zero does not mean that the variables are unrelated, only that they are not linearly related. A famous group of four graphs, called Anscombe's Quartet is shown below. Each of the four scatterplots shows the relationship between an X variable and a Y variable. All four cases have \(\mu_x=9\), \(\mu_y=7.5\), \(\sigma_x^2=11\), \(\sigma_y^2=4.125\), and \(\rho_{xy}=0.816\), in other words, they have exactly the same summary statistics, but they have very different relationships. The fact that each pair has the same correlation coefficient just means that a linear trendline through the data has the same slope.

Expected Value, Variance, and Covariance of Linear Combinations of \(X\) and \(Y\)

If \(X\) and \(Y\) are random variables, then a linear combination of \(X\) and \(Y\) could be given by \(aX + bY\). Since the expected value is a linear function, the expected value of the combination is given by:
\(E(aX+bY)=aE(X)+bE(Y)\)

The variance, however, is not a linear function. The results below can be proven without much difficulty using matrices or by expanding the sums. The variance of \(aX\) is given by:
\(\textrm{Var}(aX)=a^2 \textrm{Var}(X)\)

The variance of \(X+Y\) is given by:
\(\textrm{Var}(X+Y)=\textrm{Var}(X)+\textrm{Var}(Y)+2 \textrm{Cov}(X,Y)\)

Conceptually, this result has the same form as \((X+Y)^2=X^2+Y^2+2XY\)

Thus the variance of the linear combination is given by:
\(\textrm{Var}(aX+bY)=a^2 \textrm{Var}(X) + b^2 \textrm{Var}(Y) + 2 ab \textrm{Cov}(X,Y)\)

Note that if the two variables are independent, then \(\textrm{Cov}(X,Y)=0\). Therefore, if the two variables are independent, then
\(\textrm{Var}(aX+bY)=a^2 \textrm{Var}(X) + b^2 \textrm{Var}(Y)\)

Similarly, the covariance of a linear combination is given by:
\(\textrm{Cov}(aX + bY, cW + dZ) = ac\cdot {\textrm{Cov}}(X,W) + ad\cdot {\textrm{Cov}}(X,Z) + bc\cdot {\textrm{Cov}}(Y,W) + bd\cdot {\textrm{Cov}}(Y,Z)\)

Conditional Probability

To find the probability of an event, given that another event has occurred, we refer back to the product rule:
\(\textrm{Pr}[A \cap B]=\textrm{Pr}[A|B]\cdot \textrm{Pr}[B]\) and then isolate the dependent probability by dividing both sides by \(\textrm{Pr}[B]\):

\(\textrm{Pr}[A|B] = \displaystyle \frac{\textrm{Pr}[A \cap B]}{\textrm{Pr}[B]}\)

The same holds true for probability distributions: the conditional probability mass function of \(Y\) given the occurrence of the value \(x\) of \(X\) is the probability mass function of \(XY\) at \(X=x\) and \(Y=y\), divided by the marginal probability mass function of \(X\) at \(X=x\). The conditional probability mass function is only defined when the marginal probability forming the denominator is nonzero. In other words, if \(X\) cannot occur, then it doesn't make sense to find the probability of \(Y\) given \(X\) .

Conditional Probability Mass Functions of \(X\) and \(Y\)

\begin{align}\label{Eq:conditionals}
\nonumber p_{X|Y}(x|y) &= \displaystyle \frac{p_{XY}(x,y)}{p_Y(y)} \hspace{20pt} \textrm{ for any } y \in \{y_1,y_2,... \} \textrm{ where } p_Y(y)\ne 0 \\ \textrm{ } \\
p_{Y|X}(y|x) &= \displaystyle \frac{p_{XY}(x,y)}{p_X(x)} \hspace{20pt} \textrm{ for any } x \in \{x_1,x_2,... \} \textrm{ where } p_X(x)\ne 0
\end{align}

Examples

Many new topics have been introduced in this section: validity, independence, marginal probability, expected value, variance, covariance, correlation, and conditional probability. The examples below work through many of these concepts in the context of a problem involving theoretical probabilities.

Example

A random sample of 4 rings is selected from a bag containing 3 silver rings, 2 red rings, and 3 blue rings. Given that \(X\) is the number of silver rings and \(Y\) is the number of red rings, find the joint probability mass function of \(X\) and \(Y\), and then find the marginal probabilities.

Example

Use the results from the previous example for choosing rings, and show that \(p_{XY}\) is a valid probability mass function and that random variables \(X\) and \(Y\) are dependent.

Example

Use the results from the previous example for choosing rings, and find the expected values, variances, and standard deviations of the random variables \(X\) and \(Y\).

Since \(X\) and \(Y\) are dependent, measure the strength of the linear relationship between them by calculating the covariance and correlation coefficient.

Example

Use the results from the previous example for choosing rings to answer the following questions.

  1. What is the probability of picking less than three silver rings?
  2. If you picked no red rings, what is the probability that you picked at least two silver rings?
  3. If you picked one silver ring, what is the probability that you picked at least one red ring?

Example

Often, the PMF is not derived from theoretical probabilities but is instead estimated from experimental results. The data below shows the outcomes of vehicle crashes when using different types of safety equipment. The random variable \(X\) corresponds to the level of injury and the random variable \(Y\) corresponds to the type of safety equipment used. Use the table to answer the following questions:

  1. If you sustained minor injuries, what is the probability that you had both a seat belt and an airbag?
  2. If you used both a seat belt and an airbag, what is the probability that you sustained minor injuries?
  3. It's reasonable to assume that the injury level is not independent of the safety equipment used. Confirm this, and find the correlation coefficient.
\(X\textrm{ \ }Y\) None (0) Belt Only (1) Belt + Airbag (2) \(p_{X}(x_i)\)
None (0) 0.065 0.075 0.06 0.20
Minor (1) 0.175 0.16 0.115 0.45
Major (2) 0.135 0.10 0.065 0.30
Death (3) 0.025 0.015 0.01 0.05
\(p_{Y}(y_j)\) 0.40 0.35 0.25 1