A name gender classifier

Something I've needed to do a couple of times is take a long list of names and classify them into male and female. For example, I've looked at lists of people who attended events to see whether they were reaching more men or women - this then helps target future events.

Note: there are lots of reasons why this is a terrible idea. For a start, gender is more complicated than male/female, and there are lots of reasons why a name is a bad guide to someone's gender (and other things programmers wrongly believe about names). But I've justified it here on the basis that I'm not interested in getting 100% of the genders exactly right, but instead to get a general idea of a split in a population.

This is a good place to use a naïve Bayesian classifier. The best example of this technique is a spam classifier - this will work out whether an email is spam, based on comparing it to a model generated from a training data set. The "naïve" part comes from the fact that it's simply based on words that appear in the training data set, without considering the meaning of or connections between those words.

To do this I'm going to use python (3.6 to be exact) and the scikit library which has a naïve Bayesian classifier built in. It's based on the tutorial found here.

Step 1 - get some training data

The training data needs to be a list of names with their genders. I've found this a tricky thing to come by, but luckily my work with the Charity Commission Register of Charities throws up a good source.

The large data extract they publish, under an open license, contains a list of trustees for each charity. It's just a list of names - they hold information like address, date of birth, etc but don't publish it for data protection reasons - but it's possible to work out the gender of a proportion of the names.

Out of the X,000 names, X,000 (X%) include a title (Mrs, Miss, Ms, Mr). Using those four titles which are gender specific and cover X% of titles used. I wrote a python script to parse the name CSV file, look for names that start with Mr, Mrs, etc and put them into a male list or female list. This produces lists of X male names and X female names.

A couple of notes here. Like any training data set, this is not unbiased. Charity trustees represent a particular slice of the population - people from all walks of life are trustees, but the average trustee is older, whiter and richer than the average person. So the names we're using are going to reflect those demographics.

Note too that I'm not doing anything to separate out first names and last names - I'm just using the whole string. This means that last names will be included in the model - with the hope that they'll be equally distributed and so won't make a difference to the final total. But this isn't a guarantee.

Step 2: training the model

Step 3: testing the model

Step 4: using the model