Part I - add charity numbers
findthatcharity.uk is a site I've created to help access data about charities. One of the ways you can use it is to add details about a list of charities if you only know their names.
This is often called "reconciliation" - turning a list of names into something that unambiguously identifies the charity. Charities are often known by different names, or have variations like "Ltd" rather than "Limited", so by adding charity numbers we can make sure we're talking about the same charity whether we say "Comic Relief" or "Charity Projects".
In this post I'm going to show you how to use OpenRefine and findthatcharity.uk to turn charity names into numbers. In the next part, I'll show how to use charity numbers to fetch extra information about a charity - like their web address or latest income.
1. Get Open Refine
OpenRefine is a great data cleaning/wrangling tool. It was originally started by Google (and so is sometimes still called Google Refine), but is now run by a team of volunteer developers.
You can find the download for your platform on the download page.
Note that if you're a Windows user you won't "install" it in the same way you might for
other programs - instead you extract the zip file into a directory and then run
from that directory.
Again, OpenRefine might be a bit different from other programs you've seen in that it doesn't open in it's own window - instead it has a "server" that runs in the background, and you interact with it through a web browser. The address you visit (normally http://localhost:3333) is only accessible through your computer though - other parts of the internet can't see it.
2. Load your data
When you first open OpenRefine you'll see a page that looks a bit like this:
You can follow the instructions for importing your data. OpenRefine has some very flexible options for importing data. The important thing about your data is that it should be a "flat file" - a simple table with rows and columns. You might need to use another program (like Excel) to get your data in this format first.
I've chosen to use some data published by the Cheshire Community Foundation [xlsx - 57Kb] on the grants they've published. This data is published to the 360 Giving grant data standard and I found it in the 360 Giving data registry. It already includes charity numbers so we wouldn't need to do this reconciliation process, but it's useful to have them there to check how good the reconiliation process is.
I can point OpenRefine at this URL and it will download and load it itself. Here's what the
data looks like once it's been imported. I've hidden the
description field to get more
rows on the screen.
3. Reconcile charity names
We're going to be using the
Recipient Org:Name field to reconcile - this contains the
name of the organisation that received the grant. Click on the arrow next to the column
title to bring up the menu, and click
Reconcile > Start reconciling....
Now you need to add the Find that Charity reconciliation service. This is a web service that will recieve data from OpenRefine (in this case the name of a charity) and tries to match it to an entity. It returns a list of possible matches, and allows us to extract the ID of each match (in this case the charity number).
Add Standard Service... and then paste
in the textbox, and click
Add Service. This will then (possibly after a short delay)
add a service called "charitysearch".
You'll see a list with just "charity" on the left hand side (you might need to hide the sidebar). You can leave the default options (at the moment findthatcharity doesn't take any additional properties to help matching, but this might be something we add later).
Start Reconciling to make the magic happen. This might take a while if you
have lots of records. OpenRefine will batch up charities and send a few at a time
to findthatcharity, but the server can only cope with so many requests at a time.
If you need to do lots of requests (>5,000 in one go), speak to me directly.
You'll see a counter with the percentage complete at the top, and when finished each field will show the matches that have been made.
4. Check the matches
You now need to check the matches. Depending on your time and the number of records, there's a few options for this.
Manually check every record
You can manually check each record - just click on the single tick next to the correct match. You can also click the box with two ticks if you'd like every instance of this record to be matched to this entity.
Match all records to the best result
If you trust the algorithm, or have done some manual checking and think they're all correct, you
can just take the best match for each record. To do this, click the arrow menu next to the
Recipient Org:Name column heading, and select
Reconcile > Actions > Match each cell to its best candidate.
Filter and match
You can also combine these two approaches using filters. On the left hand side of the screen, OpenRefine has added two "facets" based on the results of the reconciliation.
judgement facet at the top shows the overall result - either
(blank) where no matches
have been found,
none where there are some matches or
matched where a single match has
been selected (some records are marked as
matched already if findthatcharity thinks they
are an exact match).
best candidate's score facet shows the score of the best match candidate, and allows you
to select different numbers by dragging the range sliders. You can also remove
these are the ones with no match.
You can drag the score fact to select only those with high matching scores - this will filter
to show only those rows with the selected scores. The
Match each cell to its best candidate
action also respects this filter, so you can use that to match all the records found by the
filter, with the remainder remaining unmatched.
5. Add charity number
Once you're happy with the matches you can add a new column with the matched charity number in.
To do this, use the arrow menu next to the column heading, and click
Edit column > Add column based on this column...
The box that pops up allows you to specify lots of different columns. Add a name for your column
at the top ("Charity Number" would make sense), and then add
cell.recon.match.id in the "Expression"
box. This will look at the cell reconcilation object, and find the
id of the matched candidate.
OK to add this column.
That's it for part one - hopefully you've now got a charity number for all the charities in your data. In the next part I'll take a look at using the charity number to add more details about the charity, like it's web address or latest income.