A linear probability model is a statistical model used to predict whether or not some event occurs, given certain characteristics about it. This type of model is often used to predict the likelihood of something happening, such as buying a particular product, based on factors like age, gender, income level, etc. A linear probability model assumes that there exists a linear relationship between those factors and the outcome.
This relationship is typically represented by a straight line equation, where the predicted value of the outcome is equal to a constant plus a weighted sum of the independent variables multiplied by their respective coefficients. Linear probability models are commonly used in marketing research, political science, economics, and sociology.
How intuitive are odds ratios?
Odds ratios are just another way of looking at logistic regression coefficients. They make sense because you know how to calculate the odds of something happening given certain conditions. For example, if I told you that there was a 90% chance that my friend had chicken pox, what would you think about that? You probably wouldn’t believe me. But if I told you that the chances of getting sick were 10 times greater if she did have chicken pox, you might start thinking differently.
If you look at the odds ratio, however, you see that it tells us exactly the same thing. There’s no difference between saying that the odds of having chicken pox are ten times greater if she does have it and saying that the odds of her having chicken pox are twice as great if she doesn’t have it. In fact, the odds ratio is simply the exponential of the logarithm of the odds. So, if the odds of having chickenpox is 0.9, the odds ratio for someone without it is 0.9 × 0.9 0.81; while the odds ratio for someone who already has it is 0.9/0.1 9.
So, why do people often use odds ratios rather than the logistic regression coefficient itself? Well, the logistic regression coefficient is much harder to understand than the odds ratio. And even though the odds ratio is easier to explain, it’s still very difficult to interpret.
Interpretability
We will begin by comparing the two models explicitly. A dichotomy is a situation where the outcome Y has values 1 and 0, and p represents the probability that Y will be 1, given some value of the regressor X. Based on these models, we arrive at the following:
The equation p is equal to a0 + a1X1 + a2X2 + … + a kXk
The ln[p/(1-p)] value is equal to the product of the numbers b0 + b1X1 + b2X2 + … + bkXk
A linear model assumes that probability p depends linearly on the regressors, whereas a logistic model assumes that p/(1-p) depends linearly on the regressors.
Linear models are more interpretable than other models. For example, if a1 is .05, a 1-unit increase in X1 increases the probability of Y being 1 by 5 percentage points. Most people understand what it would mean to increase their chances of voting, dying, or becoming obese by five percentage points.
A logistic model is less interpretable than a linear model. A .05 increase in b1 in the logistic model results in an increase in log odds that Y is 1 for a one-unit increase in X1. In what sense does that mean? In my experience, I have never encountered anyone with any intuition for log odds.
A rule of thumb
If you’re building a machine learning model, it’s important to know whether the predictions you’re making are extremely likely or very unlikely. This distinction affects how you interpret the output of the model, and what you do next.
For example, suppose you’re trying to predict whether someone will vote in an election. You might build a model that predicts whether someone will vote based on his or her age, gender, income level, race, religion, education level, occupation, political affiliation, marital status, number of children, and state of residence.
You could choose to make the prediction based on each of those factors separately, or you could combine them into one big model, like this:
Age + Gender + Income Level + Race + Religion + Education Level + Occupation + Political Affiliation + Marital Status + Number Of Children + State Of Residence Voter
The problem here is that there are four categories where we’d expect to see almost no correlation between voter behavior and the listed factors. For instance, people of different races tend to live in states with different economic conditions. And while voters of different religions tend to hold similar views on politics, they don’t necessarily agree on everything else.
So if we take the average of these four variables, we’ll end up predicting that most people will vote based on factors that are completely unrelated to their true motivations. We might even end up predicting that a majority of people will vote based on their gender alone.
That’s why it makes sense to look at the predicted probabilities from our combined model. If the predicted probability of voting is close to zero, then our model doesn’t really tell us anything useful. But if the predicted probability is high, then we know that something interesting is happening.
How nonlinear is the logistic model?
The logistic model is often used in practice because it is easier to understand. If we want to know whether someone will buy something, we might ask how likely he or she is to do so. This question is answered by some simple math. We take the probability of buying the product, multiply it by the price of the product, add up those products across all people, and divide by the total number of people.
This method works fine for most things. But what about something like whether someone will vote for you in an election? Or whether someone is likely to get cancer? These questions don’t really make sense in terms of probabilities. They’re more like percentages—how likely is it that I’ll win the election? How likely is it that I’m going to get cancer?
When we talk about probabilities, we mean “the chances of something happening.” So we’d say that there’s a 50% chance that I’ll win the next presidential election. We could use our old formula to calculate that likelihood. However, we usually prefer to use the logistic model. Why? Because it makes it easy to see why the answer is 50%.