Machine Learning for indentify the author of an email

Welcome to our first post about machine learning, It’s on the topic of Naive Bayes. We’re going to do something that can be easy but we think It's a really interesting application of machine learning. We’re going to identify who are the author of an email set.

So we have a bunch of emails sent from Brainattica and the question is: can we identify who is the author of a new unseen email?

Our goal is to classify sender email based only on the text of the email. So, let’s started!

Features and Labels

In machine learning we often take as input features and we try to produce labels. Let me explain you what they are.

We're are going to use music as an example. If we take a song we have to extract what's called features. These might be things such as:

  1. Intensity
  2. Tempo of the song
  3. Genre
  4. Voice gender

After that our brain processes them into one of these two categories (labels):

  1. Like
  2. Don't like.

We're going to produce what's called a scatter plot. For simplicity, we'll just use two features: there's tempo and intensity:

Imagine we have different songs (each chart data point) where tempo and intensity are different according to this chart. Then suppose the green data points are the ones that you like.

Now given a song, we can classify it into these categories:

  1. Like
  2. Don´t like.

Naive Bayes

Naive Bayes classifier is a supervised learning algorithm based on applying Bayes' theorem with the “naive” assumption of independence between every pair of features.

One particular feature of Naive Bayes is that it’s a good algorithm for working with text classification. When dealing with text, it’s very common to treat each unique word as a feature, and since the typical person’s vocabulary is many thousands of words, this makes for a large number of features. The relative simplicity of the algorithm and the independent features assumption of Naive Bayes make it a strong performer for classifying texts.

For our objective we have the words of each email as features, and the authors as labels.

For example, our labels are:

  1. Enrique has label 0.
  2. Juan has label 1.

An example of the list of features could be:

[' sbaile2 nonprivilegedpst susan pleas send the forego list to richard thank   enron wholesal servic 1400 smith street eb3801a houston tx 77002 ph 713 8535620 fax 713 6463490', ' sbaile2 nonprivilegedpst 1 txu energi trade compani 2 bp capit energi fund lp may be subject to mutual termin 2 nobl gas market inc 3 puget sound energi inc 4 virginia power energi market inc 5 t boon picken may be subject to mutual termin 5 neumin product co 6 sodra skogsagarna ek for probabl an ectric counterparti 6 texaco natur gas inc may be book incorrect for texaco inc financi trade 7 ace capit re oversea ltd 8 nevada power compani 9 prior energi corpor 10 select energi inc origin messag from tweed sheila sent thursday januari 31 2002 310 pm to   subject pleas send me the name of the 10 counterparti that we are evalu thank']

We will start by giving you a list of strings. Each string is the text of an email, which has undergone some basic preprocessing; we will then provide the code to split the dataset into training and testing sets. (In next posts you’ll learn how to do this preprocessing and splitting yourself, but for now we’ll give the code to you).

Preprocess function takes a pre-made list of email texts (by default worddata.pkl) and the corresponding authors (emailauthors.pkl) and performs a number of preprocessing steps:

  1. Splits into training/testing sets (10% testing).
  2. Vectorizes into tfidf matrix.
  3. Selects/keeps most helpful features

After this, the features and labels are put into numpy arrays, which play nice with sklearn functions.

For get good results we set the test size to 10% of data set. To split data set into train data set and test data set we use "cross_validation" method from scikit learn.

features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)  

featurestrain and labelstrain are the data sets which we are going to use to train our learning model.

features_test is the data set we are going to try to predict.

Having done that we are going to train our model:

from sklearn.naive_bayes import GaussianNB

clf = GaussianNB(), labels_train)  

Now the last step is how we can predict or identify the author of anothers emails with our learning trained model:

prediction = clf.predict(features_test)  
print("Prediction: {}".format(prediction))  

We can calculate the accuracy of our prediction too:

accuracy = clf.score(features_test, labels_test)  
print("Accuracy: {}".format(accuracy))  

of course, here is the repo with all code:

As well as I hope you found this post useful and funny.
Have more fun!

comments powered by Disqus