Logistic Regression is Worth Learning
Logistic Regression (LR) is not a divine truth that exists as a Platonic ideal. It is possible to design LR from first principles. LR is a great gateway to learn a variety of topics in mathematics, statistics and machine learning. And as a bonus, LR is useful in practice. Let’s talk about LR.
Corporate roleplay
It is Tuesday. You are enjoying a cup of coffee and your manager calls you. She says that we need to predict if a customer goes away (churns) in one month. You listen to her until you agree to do it, probably because your coffee is getting cold. Eventually you start to think about the problem and come up with a very short specification:
- Churn happens or doesn’t. It is a binary outcome per customer.
- Your manager wants to prioritize her time with customers that are the most likely to churn. She needs the estimated probability of churn per customer.
- Your manager will be interested to understand how the predictions are made, especially if they are accurate. She might learn useful insights on how different factors affect customer churn.
- The probability estimates should not fluctuate heavily all the time.
Start simple
Time to define the churn prediction task. We need a model that outputs the estimated probability of churn \(\hat{p}\):
$$ \hat{p}=P(churn | \vec{x}) \in [0,1] $$
The symbol \(\vec{x}\) denotes a \(d\)-dimensional vector \(\vec{x}=\{x_1,x_2,…x_d\}\) of \(d\) numbers. Let’s call them data vectors. Also known as feature values in machine learning lingo. If a specific customer has a corresponding data vector \(\vec{x}\), then this customer has an estimated churn probability of \(P(churn | \vec{x})\). A math person might write \(\vec{x} \in \mathbb{R}^d\) meaning that the data vectors are contained in \(d\)-dimensional Euclidean space. Relevant data vectors for churn prediction might be available in your Customer Relationship Managemen (CRM) system database.
We need to define a statistical model for \(\hat{p}\). Like, explicitly tell how \(\hat{p}\) is manifested in our world. Let’s define a simple model for easy computation, inference and implementation. You are also more likely to see consistent results. Not so much with large neural networks where random initialization can give you a different winning lottery ticket every time, even if the data do not change. Machine learning is probably the only field where winning the lottery means that you have to work and grind even harder.
You learn to expect that training a large neural network over and over again using the same data gives you different results.
However.
The definition of insanity is doing the same thing over and over again and expecting different results.
Deep learning, not even once. Makes you crazy by definition… You can stop throwing tomatoes at me. I understand that if you fix every initial condition, you will get identical results over repetitions. And that there are techniques making the results more stable.
Can you imagine how unhappy your manager might become, and how hard her job would be, if “nothing changes” and the estimates change significantly? Of course we are assuming that you get acceptable results using a simple model. You can always test if a simple model is accurate enough in your business context. Anyway, I believe it is always a good idea to start with a simple model when it makes sense. Tip: you might need a complex model for speech recognition.
Ok, sorry for that digression, let’s get back to designing a simple model for \(\hat{p}\). How about the following linear functional form using a dot product?
$$ \hat{p}(\vec{x},\vec{w}) = \vec{x} \cdot \vec{w} = \sum\nolimits_i^d{\vec{x}_i\vec{w}_i} = \vec{x}_1\vec{w}_1 + … + \vec{x}_d\vec{w}_d $$
How is the dot product useful for our purposes? It matches and activates to patterns that are exhibited by customers with a high probability of churn. It maps a customer-specific \(\vec{x}\) to a large value if \(\vec{x}\) contains patterns that indicate churn. Notice that you can only tweak the parameter vector \(\vec{w}\) because \(\vec{x}\) is already fixed and given. This is the reason why \(\vec{w}\) contains the model’s knowledge, why it is called a parameter vector and why finding the values of \(\vec{w}\) is called training the model.
Unfortunately, the functional form that we designed sucks:
- The range of the dot product \(\vec{x} \cdot \vec{w}\) is \((-\infty,\infty)\), which is not great if we want \(\hat{p}\) to represent a probability. Our earlier definition of \(P\), and the definition of a probability in general, says that a probability takes values within \([0,1]\). How about using the dot product as a unnormalized probability? No thanks, it’s an ad hoc hack. How large values are considered to be large enough to indicate churn? It is a nightmare to find a reliable magic threshold.
- If the norms of \(\vec{x}\) or \(\vec{w}\) are large, then large changes are required for significant change in \(\hat{p}\). It is a gateway to overfitting your models to noise or having some variables dominate others just because they are measured in larger absolute units.
Time for a new iteration.
Responsive math design
How to make \(\hat{p}\) more sensitive to smaller changes in \(\vec{x}\) or \(\vec{w}\)? Exponents. Those have some serious firepower. Let’s use the natural exponential function as follows:
$$ \hat{p}(\vec{x},\vec{w}) = e^{\vec{x} \cdot \vec{w}} $$
Now changes in \(\vec{x}\) or \(\vec{w}\) result in multiplicative change in \(\hat{p}\).
The previously mentioned problem of \(\hat{p}\) not being a probability is still here. The value of \(\hat{p}\) is between zero and positive infinity: \(e^{\vec{x} \cdot \vec{w}} \in (0,\infty)\). Therefore, \(\hat{p}\) is not a valid probability unless we use duck tape and define \(\hat{p}=min(1,e^{\vec{x} \cdot \vec{w}})\). Just… no. Let’s come up with something better.
Houston, we have a probability
What is a suitable function that maps \((0,\infty)\) to \((0,1)\)?
Time passes, the head scratching is audible and nothing comes up. Time to quit? No. We should try something different. So far we have been focusing on the right side of the equation.
Let’s define a function called odds to map a probability \(p \in [0,1)\) to \([0,\infty)\):
$$ odds(p) = \frac{p}{1-p} $$
The inverse function of odds maps \([0,\infty)\) back to a probability \([0,1)\):
$$ p(odds) = \frac{odds}{1+odds} $$
Ok. So we have the following functions available:
Function | Domain (from) | Range (to) |
---|---|---|
\(odds(p) = \frac{p}{1-p}\) | \([0,1)\) | \([0,\infty)\) |
\(p(\vec{x},\vec{w})=e^{\vec{x} \cdot \vec{w}}\) | \((-\infty,\infty)\) | \((0,\infty)\) |
Interesting. Two equations with matching legos. It is possible to model the \(odds\) function as a dot product since they have compatible ranges:
$$ odds(p) = p(\vec{x},\vec{w}) $$
Now throw the inverse odds function into the mix:
$$ p(odds(\vec{x},\vec{w})) = \frac{e^{\vec{x} \cdot \vec{w}}}{1+e^{\vec{x} \cdot \vec{w}}} = \frac{1}{1+e^{-\vec{x} \cdot \vec{w}}} $$
You’ll get the last form by multiplying \(\frac{e^{\vec{x} \cdot \vec{w}}}{1+e^{\vec{x} \cdot \vec{w}}}\) with \(\frac{e^{-\vec{x} \cdot \vec{w}}}{e^{-\vec{x} \cdot \vec{w}}}\). Since \(p\) behaves like a probability, and we want a probability, let’s redefine \(\hat{p}\) using \(p\) like a coder who re-assigns a variable value:
$$ \hat{p} = P(churn | \vec{x}) = p(odds(\vec{x},\vec{w})) = \frac{1}{1+e^{-\vec{x} \cdot \vec{w}}} $$
Well, well, look who’s here. Our good friend Logistic Regression. The model outputs a valid probability estimate and its decision boundary is linear. We have a model that fulfills our short specification for churn prediction. We might be wrong but at least we are not guaranteed to be wrong.
Please wait. Do not deploy your model in production yet. Your manager is not going to be happy with the results. Why? The model spews out noise, it makes random guesses, no intelligence involved. Why? There is zero knowledge in the model since \(\vec{w}\) is unchanged. The model needs to be trained and the parameters adjusted using training data.
Why the name?
Why LR is called Logistic and Regression? Let’s take some shortcuts instead of repeating all the previous steps. The following function is called logit and it is the logarithm of our previously defined odds function:
$$ logit(p) = ln(\frac{p}{1-p}) \in (-\infty,\infty) $$
Let’s build a model where the logit defined as the previously used dot product. It is fine since \(\vec{x} \cdot \vec{w} \in (-\infty,\infty)\):
$$ ln(\frac{p}{1-p}) = \vec{x} \cdot \vec{w} $$
Solve for \(p\) and you’ll get:
$$ p = \frac{e^{\vec{x} \cdot \vec{w}}}{1+e^{\vec{x} \cdot \vec{w}}} = \frac{1}{1+e^{-\vec{x} \cdot \vec{w}}} = \sigma(\vec{x} \cdot \vec{w}) $$
We find LR again. The function \(\sigma\) is called a sigmoid function and it is a special case of logistic function. Sigmoid function is actually the inverse function of the logit function.
Regression comes from the linear regression model that we throw in the sigmoid function, which is a special case of logistic function. That’s Logistic and Regression in Logistic Regression.
Learning in machine learning
We have defined the model, which turned out to be LR. The next step is to put learning in machine learning. For our LR, learning means finding a parameter vector \(\vec{w}\) that outputs a large value of \(\vec{x} \cdot \vec{w}\) when \(\vec{x}\) corresponds to a customer with a high probability of churn, and vice versa.
What is a principled basis for tweaking the parameter vector? First of all, it is mathematically impossible to create information from nothing. We need to continue injecting knowledge and assumptions into the system, which we already started by defining the functional form or our LR model. How about observations of what has happened so far? Well, at least they might reflect the real world whereas our own guesses might not. Luckily your CRM system maintains a database of past and current customer behavior. We can extract \(n\) pairs of data vectors \(\vec{x}_i\) and churn outcomes \(y_i \in \{0,1\}\). This set of actual, historical behavior is called a dataset:
$$ \mathbf{X}=\{(\vec{x}_1,y_1),…,(\vec{x}_n,y_n)\} \subset \mathbf{R}^d \times \{0,1\} $$
Okay. How to get started in figuring out a suitable \(\vec{w}\)? How do we even know what is a good \(\vec{w}\)? Well, it is impossible to define what is good if there is no reference of bad.
If you can not measure it, you can not improve it.
Thanks Antoine-Augustin. Indeed, it is a fundamental necessity to have a function \(L(\vec{w} | \mathbf{X})\) that tells us how good fit a given parameter vector is for the available dataset. Let’s call it a likelihood function. This formulation let’s us to find a parameter vector \(\vec{w}_*\) that is consistent with the dataset by maximising the likelihood function value:
$$ \vec{w}_*=argmax_{\vec{w}}L(\vec{w} | \mathbf{X}) $$
Therefore, \(\vec{w}_*\) gives us LR that is aligned with the dataset. The implicit assumption for making predictions using previously unseen data vectors is that the patterns in past behavior are indicative and predictive of the future behavior. If this assumption does not hold for churn prediction, then you’ll get incorrect predictions. Therefore, we will continue by assuming that many customers share similar reasons for churn. Let’s call \(\vec{w}_*\) a solution since it is the parameter vector of the highest likelihood for the given data.
Notice that if we had a dataset of infinite size (\(n=\infty\)), then we wouldn’t need LR at all. We could calculate the churn probability as the ratio between the number of churn positives and total observations for a data vector. It would be a database query. However, since obviously we don’t have an infinite amount of data, let alone a computer that runs for an infinite time, we have to make assumptions and find patterns. And hope that the customers now and in the future exhibit the same patterns.
Ok. What next? We need to define the functional form of the likelihood function that computes the compatibility of a parameter vector with a given dataset. Let’s start by defining the likelihood function for a single pair of a data vector and outcome. Notice that it is a dataset of size \(n=1\). How about the following definition?
$$ L(\vec{w} | \vec{x},y) = \begin{cases} \hat{p}(\vec{x},\vec{w}) & \text{if } y=1 \\ 1-\hat{p}(\vec{x},\vec{w}) & \text{if } y=0 \end{cases} $$
If a customer did actually churn (\(y=1\)), then the likelihood is the estimated probability of churn. If the customer didn’t churn (\(y=0\)), then the likelihood is the estimated probability of not churning. No matter what is the outcome, if we maximize the value of \(L(\vec{w} | \vec{x},y)\), then we maximize the fit of our LR model to the data. It is exactly what we want. In other words, maximizing the likelihood function maximizes the amount of correctly assigned probability. Let’s do what coders do and refactor it into a single expression for mathematical convenience:
$$ L(\vec{w} | \vec{x},y) = \hat{p}(\vec{x},\vec{w})^{y}(1-\hat{p}(\vec{x},\vec{w}))^{1-y} $$
Done. The definition has not changed. Let’s assume that the pairs of \(\vec{x}_i,y_i\) are independent and identically distributed. Now we can generalize the definition of the likelihood function for datasets of size \(n \ge 1\) by defining a joint probability:
$$ L(\vec{w} | \mathbf{X}) = \prod\nolimits_i^n \hat{p}(\vec{x_i},\vec{w})^{y_i}(1-\hat{p}(\vec{x_i},\vec{w}))^{1-y_i} $$
Phew! There it is. Maximize \(L(\vec{w} | \mathbf{X})\) to get a solution and call it a day. However, aren’t we missing something quite essential here? We have to maximize the likelihood function and it’s not going to happen by wishful thinking alone.
Learning to learn continues
The situation is as follows. We have a dataset and we have a likelihood function. The next step is to find a parameter vector to maximize the likelihood function. It smells like an optimization task. Where and how to get started? Maybe you could gamble and sample random parameter vectors from a uniform distribution for ten minutes, and save the \(\vec{w}\) with the highest likelihood function value. Well… you might get lucky and get reasonable results. Or not. It is not deterministic, there are no guarantees of optimality even if you throw dice for the rest of your life.
Do you happen to remember calculus? Derivatives, integrals, rate of change. The stuff that was butchered by a boring teacher. The stuff that you never used? Good news. You did not study for nothing. We have a real use case for calculus. Let’s start.
Calculus is the mathematics of change where derivatives measure the rate of change of a function at a given point. Like, how stable the particular location of a function is. Or how sensitive the function is to a given input. We can use derivatives to define the change in our likelihood function given a parameter vector as an input. Sounds promising. Actually, we can calculate the direction where we need to move the parameter vector to increase the value of the likelihood function. That’s exactly what we need, a tool served for us on a silver platter. To use this mighty tool, we need to define the derivatives of the likelihood function with respect to the individual elements of a parameter vector. They are called partial derivatives. When we collect the partial derivatives into a vector, then it is called a gradient. Now, the gradient gives us the direction from \(\vec{w}\) that increases the \(L(\vec{w} | \mathbf{X})\) the fastest.
We have already spelled out the algorithm for maximizing the likelihood function. Push the parameter vector towards the direction defined by the gradient of the likelihood function with respect to the parameter vector. If we take repeated turns between calculating the gradient and updating the parameter vector, we arrive to an optimization algorithm called gradient ascend. Hopefully math people are not offended by our abuse of notation when we define the following assignment using the equality sign:
$$ \vec{w} = \vec{w} + \alpha\nabla_wL $$
The alpha is a parameter of the gradient descend, and it is not a parameter of the LR model. In machine learning lingo, the alpha is a hyperparameter and it is called a learning rate in this gradient ascend context. The alpha is utilized to stabilize the gradient ascend by not taking too large or small steps towards the gradient. You can try \(\alpha=\frac{0.05}{n}\) as a starting point. The meaning behind the name of the gradient ascend is hopefully clear. The parameter vector ascends towards the direction of increased likelihood.
Now, to train the LR model, we can repeat the gradient ascend step until the likelihood function does not increase anymore. Done, lights out, time to go home now. Or not. We haven’t derived the gradient yet. I promise that this is as deep as the recursion gets. Light is visible at the end of the tunnel already.
Gradient without colors
The parameter vector has \(d\) elements. However, without loss of generality, we can derive the partial derivative of the \(j\):th element with respect to the likelihood function, and just copy it for the remaining elements to form the gradient. As a refresher, the following was the definition of the likelihood function:
$$ L(\vec{w} | \mathbf{X}) = \prod\nolimits_i^n \hat{p}(\vec{x_i},\vec{w})^{y_i}(1-\hat{p}(\vec{x_i},\vec{w}))^{1-y_i} $$
The product is quite hairy and feels troublesome. Like, the equation is a giant multiplication. Have fun deriving monster sized equations. No thanks.
Let’s use a transformation that doesn’t change the solution \(\vec{w}_*\) while it makes the derivation easier. One such a function is the natural logarithm, which is a monotonic function. It has the awesome property of converting products into sums:
$$ ln(L(\vec{w} | \mathbf{X})) = l(\vec{w} | \mathbf{X}) = \sum\nolimits_i^n y_iln(\hat{p}(\vec{x_i},\vec{w})) + (1-y_i)ln(1-\hat{p}(\vec{x_i},\vec{w})) $$
The partial derivative of \(l(\vec{w} | \mathbf{X})\) with respect to the \(j\):th element of \(\vec{w}\) is denoted as \(\frac{\partial l(\vec{w} | \mathbf{X})}{\partial w_j}\). Time to wear a calculus hat and start deriving the derivative. Let’s temporarily use a shorter notation \(\hat{p}(\vec{x},\vec{w})=\sigma(z)=\hat{p}\) where \(z=\vec{x} \cdot \vec{w}\) to make the equations shorter.
$$ \begin{aligned} \frac{l(\vec{w} | \mathbf{X})}{\partial w_j} &= \frac{\partial}{\partial w_j}[\sum\nolimits_i^n y_iln(\hat{p}) + (1-y_i)ln(1-\hat{p})] \cr &= \sum\nolimits_i^n \frac{\partial}{\partial w_j}[y_iln(\hat{p}) + (1-y_i)ln(1-\hat{p})] \cr &= \sum\nolimits_i^n \frac{\partial}{\partial w_j}[y_iln(\hat{p})] + \frac{\partial}{\partial w_j}[(1-y_i)ln(1-\hat{p})] \cr &= \sum\nolimits_i^n y_i\frac{\partial}{\partial \hat{p}}\frac{\partial \hat{p}}{\partial w_j}[ln(\hat{p})] + (1-y_i)\frac{\partial}{\partial \hat{p}}\frac{\partial \hat{p}}{\partial w_j}[ln(1-\hat{p})] \cr &= \sum\nolimits_i^n y_i\frac{1}{\hat{p}}\frac{\partial}{\partial z}\frac{\partial \hat{z}}{\partial w_j}[\hat{p}] + (1-y_i)\frac{1}{1-\hat{p}}(-1)\frac{\partial}{\partial \hat{z}}\frac{\partial \hat{z}}{\partial w_j}[\hat{p}] \cr &= \sum\nolimits_i^n \frac{y_i}{\hat{p}}\frac{\partial}{\partial z}\frac{\partial \hat{z}}{\partial w_j}[\hat{p}] + \frac{y_i-1}{1-\hat{p}}\frac{\partial}{\partial \hat{z}}\frac{\partial \hat{z}}{\partial w_j}[\hat{p}] \cr \end{aligned} $$
Okay… let’s have a pause here and derive \(\frac{\partial \hat{p}}{\partial z}\) separately so that we can have a break from the previous monster.
$$ \begin{aligned} \frac{\partial \hat{p}}{\partial z} &= \frac{\partial}{\partial z}\frac{1}{1+e^{-z}} \cr &= \frac{\partial}{\partial z}(1+e^{-z})^{-1} \cr &= -(1+e^{-z})^{-2}\frac{\partial}{\partial z}(1+e^{-z}) \cr &= -(1+e^{-z})^{-2}e^{-z}\frac{\partial}{\partial z}[-z] \cr &= (1+e^{-z})^{-2}e^{-z} \cr &= \frac{e^{-z}}{(1+e^{-z})^2} \cr &= \frac{1}{1+e^{-z}}\frac{e^{-z}}{1+e^{-z}} \cr &= \hat{p}(1-\hat{p}) \cr \end{aligned} $$
Back to the likelihood business:
$$ \begin{aligned} \frac{l(\vec{w} | \mathbf{X})}{\partial w_j} &= \sum\nolimits_i^n \frac{y_i}{\hat{p}}\frac{\partial}{\partial z}\frac{\partial \hat{z}}{\partial w_j}[\hat{p}] + \frac{y_i-1}{1-\hat{p}}\frac{\partial}{\partial \hat{z}}\frac{\partial \hat{z}}{\partial w_j}[\hat{p}] \cr &= \sum\nolimits_i^n \frac{y_i}{\hat{p}}\hat{p}(1-\hat{p})\frac{\partial \hat{z}}{\partial w_j}[z] + \frac{y_i-1}{1-\hat{p}}\hat{p}(1-\hat{p})\frac{\partial \hat{z}}{\partial w_j}[z] \cr &= \sum\nolimits_i^n \frac{y_i}{\hat{p}}\hat{p}(1-\hat{p})x_i^j + \frac{y_i-1}{1-\hat{p}}\hat{p}(1-\hat{p})x_i^j \cr &= \sum\nolimits_i^n y_i(1-\hat{p})x_i^j + (y_i-1)\hat{p}x_i^j \cr &= \sum\nolimits_i^n y_ix_i^j-y_i\hat{p}x_i^j+y_i\hat{p}x_i^j-\hat{p}x_i^j \cr &= \sum\nolimits_i^n y_ix_i^j-\hat{p}x_i^j \cr &= \sum\nolimits_i^n (y_i-\hat{p})x_i^j \cr \end{aligned} $$
There it is! The partial derivative that can save us. Let’s say it aloud once more:
$$ \frac{\partial l(\vec{w} | \mathbf{X})}{\partial w_j} = \sum\nolimits_i^n (y_i - \hat{p}(\vec{x_i},\vec{w}))x_i^j $$
The symbol \(x_i^j\) means the \(j\):th element of the \(i\):th data vector in the dataset. Now, the gradient is as follows:
$$ \nabla_wl(\vec{w} | \mathbf{X}) = \{\frac{\partial l(\vec{w} | \mathbf{X})}{\partial w_1},…,\frac{\partial l(\vec{w} | \mathbf{X})}{\partial w_d}\} $$
Now you are ready to train your LR for churn prediction:
- Get a dataset
- Initialize \(\vec{w}\) with random values
- Train the LR:
- Calculate \(\nabla_wl(\vec{w} | \mathbf{X})\)
- Update \(\vec{w} = \vec{w} + \alpha\nabla_wl(\vec{w} | \mathbf{X})\).
- If the likelihood function \(L(\vec{w} | \mathbf{x})\) increased, then go to 3a. If not, then \(\vec{w}_*=\vec{w}\) and go to 4.
- Use your LR to make churn predictions as \(\hat{p}(\vec{x},\vec{w}_*) = \sigma(\vec{x} \cdot \vec{w}_*)\)
Conclusion
LR is worth learning since it covers multiple topics. LR is also relevant and useful in practice. I skipped some topics that might be helpful to know. Nevertheless, you are ready to implement the first version of the LR for churn prediction.
A coder can import scikit-learn in Python and use the provided LR without any knowledge of the previously explained stuff. That’s fine until its not. For example, the gradient ascend of the likelihood function might not converge properly. If you understand the concepts behind LR, then you might have an intuition of how to fix the problem. Otherwise, you have to guess what to ask at Stack Overflow.
LR has some direct connections to more complex models:
- If you Kaggle, then you might be aware that sigmoid function is available in XGboost for binary classification.
- If you think in neural networks, then LR is a neural network with one sigmoid-activated unit. LR is also often the last transformation in deep neural networks for binary classification. The deep neural network tries to unfold the data before it arrives to the LR, which throws a linear decision boundary at the data. The transformations find a data representation that is linearly separable for the LR. Therefore, the purpose of the millions and billions of parameters is to serve the LR by finding a data transformation that the mighty LR can comprehend, also known as feature learning. I find it funny, it’s like trying to build a simple joystick with one button to fly a massive passenger airplane. Anyway.
Further topics, exercises and tasks for you:
- How would you measure the accuracy of your LR as a churn predictor for unseen data vectors? What is overfitting? What is cross-validation?
- What is regularization? What is the difference of \(l_1\) and \(l_2\) regularization? Derive the gradient \(\nabla_wl(\vec{w} | \mathbf{X})\) when the likelihood function is re-defined as \(l(\vec{w} | \mathbf{X}) - l_2(\vec{w})\).
- Check out softmax function for generalizing LR to have a categorical output. Derive the gradient for this softmax multiclass LR. What is the intuition behind naming softmax as softmax?
- There are arguably more efficient algorithms available for the gradient ascend. Find some of them.
- What is the difference when a parameter vector is at a global optimum or at a local optimum?
- What is a greedy algorithm? Why is the gradient ascend a greedy algorithm?
- Why not to worry about ending up at a local optimum with LR, even when using the gradient ascend that is a greedy algorithm?
- Implement the LR evaluation and training from scratch in Python. Utilize some public dataset. Use NumPy for vectorizing the computations.
- What is entropy in information theory? How is \(l(\vec{w} | \mathbf{X})\) related to entropy and what is the intuition? Why is the negative of \(l(\vec{w} | \mathbf{X})\) often called binary cross-entropy loss?
- What is the connection of \(L(\vec{w} | \mathbf{X})\) and Bernoulli distribution?
Peace and harmony.