Bayes Classifier (with python code)

Negar Khojasteh
2 min readJan 26, 2020

Until last year I didn’t take machine learning (ML) classes because I always thought that only students who know so much about mathematics and statistics can do ML and data science. During my PhD, I had to take the computational methods course as a requirement and I absolutely loved the class and I realized that anyone can learn some ML and data science algorithms with practice. I shared my excitement with some friends and realized that some blog posts might be a useful resource. This post helps you start and write your very first ML code even if you’ve never done it before.

I’m not an expert and this series won’t be perfect of course. I won’t get into too much detail but I’ll mention the concepts and basics you’d need to know for each part to understand the method and the code.

Where to start?

Python could be a bit scary if you don’t have any experience. Jupyter notebooks make the whole thing much much easier because you get to see the output of each segment of your code separately. This makes it so easy to debug your code. (after installing, on mac terminal, type “jupyter notebook” to begin)

Concepts:

  • probability & counting
  • permutations & combinations
  • conditional probability (to learn Bayes)
  • Bayes’ rule or theorem (and Naive Bayes algorithm) which could be used as a simple classifier

Learning the Basics:

  • test set vs. train set

Application:

One application of Naive Bayes is for Binary classifiers. For example, based on the probability of a set of words in spam and non-spam emails (train set) we can have a Naive Bayes algorithm to classify new emails (test set). Naive Bayes has the key assumption of independence (“naive”) though. By independence, we mean that statistically, we assume any two words in our set of words occur independently of each other which is not exactly true in practice. However, the algorithm still works relatively well since we are not interested in exact probability values and rather we set a threshold (e.g., 0.9) and say above this count as spam and below count as non-spam.

Based on what I learned in class, I’ve written a simple code that classifies spam vs non-spam emails.

See my notebook here

I got the data from here. The associated blog post is here.

Note: I got permission from the professor who teaches this class at Cornell to write the codes that are inspired by his class.

Raw data from:

https://medium.com/analytics-vidhya/building-a-spam-filter-from-scratch-using-machine-learning-fc58b178ea56

https://github.com/Gago993/SpamFilterMachineLearning

--

--