Saturday, April 20, 2013

Is the Programming Historian 2 a MOOC?

'Evil Robot' by Jennifer Morrow (cc-by)
A few months ago I was asked if the Programming Historian 2 is a MOOC. For the uninitiated, a MOOC is a Massive OpenOnline Course. They’ve been popping up online for the past couple of years, principally at major American universities like MIT and Stanford, claiming to be able to teach thousands or even hundreds of thousands of students at the same time – for free. They’ve so far had mixed results but it seems most people in academia have an opinion on them – either, meh it’s a fad, damn we gotta get one of those at our school, or the robots have come for our jobs! Defend! Defend!

I can’t speak for the other editors of the Programming Historian 2 (PH2). But I can say: No. I don’t think the PH2 is a MOOC.  If you havn’t found us yet, the PH2 is an open access series of tutorials designed to let humanities researchers get their toes wet with computer programming. The lessons involve learning simple programming tasks that are immediately useful to ordinary working humanists. That might be automatically downloading historical recordsfrom the Internet, or analyzing a collection of sources with topic modeling. All of the lessons are online – like a MOOC – and there is no teacher in the room with you – like a MOOC.

So why no MOOC? For me, what sets a MOOC apart from a classroom-based course is a belief that the tutor-tutee relationship can be depersonalized and made redundant. MOOCs replace this relationship with a series of steps. If you learn the steps in the right order and engage actively with the material you learn what you need to know and who needs teacher?

I don’t think that’s what we’re about. Instead, some of the most exciting feedback we’ve got at the PH2 has been from academics who have used the PH2 as a teaching tool in their classroom. Either they’ve assigned lessons for their students to work through, they’ve challenged students to write lessons of their own, or they’ve used the PH2 to teach themselves a skill that they can then pass along to their students.

That’s not to say you can’t use the PH2 to teach yourself some programming if you havn’t got a teacher. It’s to say the PH2 is not the evil robot looking to take your job away. It’s the friendly robot looking to give your teaching toolkit a few more options, and maybe a new skill or two with which to impress your friends and colleagues. Not unlike a book. And Books havn’t put literature professors out of a job, but they have made English lit courses more interesting.

Monday, April 15, 2013

Trust Me: The Old Bailey Online as a model for digitization projects

The Old Bailey Online (OBO) turned 10 years old this week, and to celebrate, Sharon Howard has been encouraging blog posts and tweets from the project's wide network of contributors. I thought I'd add just a few brief thoughts on what I like about the OBO, and why I avoid so many other competing digitization projects. Rather than explain what the OBO is, I thought I'd save time and steal the explanation from their own website:
A fully searchable edition of the largest body of texts detailing the lives of non-elite people ever published, containing 197,745 criminal trials held at London's central criminal court.
The trials run from 1678 to 1914, making it a great resource for social historians or historians of crime. I broadly fit into both of those categories, but what really interests me is knowledge management. I want to know how we can extract useful knowledge from bodies of text far larger than we could ever read in our lifetime. I'm interested in the historical research questions I pursue, but I'm more interested in the processes of understanding and discovery that the pursuing of those questions lets me explore. That is to say: I'm more interested in how we can know something than what we find out. This all means I have slightly different criteria for a good resource than does a typical historian. When I'm planning a project I'm not looking for 'gaps in the literature'. Instead, I'm really only looking for 2 things:
  1. A corpus of downloadable electronic text
  2. A corpus that does not assume I want to read anything
 1) A Corpus of Electronic Text

At the moment my work is almost exclusively based on textual analysis. By that I mean I work with words rather than sounds or images or smells or physical objects. I want to know what human knowledge is contained in the symbols on pages. That means for me the best thing you can give me is a good clean set of electronic text. The Old Bailey Online does this beautifully - better than just about anyone else actually - by providing more than a hundred million words of transcription. Most important: the OBO is entirely downloadable. That means I can put it on my own computer and I can measure it, twist it around, write programs to analyse it, use other people's programs...anything I like. No one is going to threaten to sue me or press criminal charges for downloading the records, And best of all, once I have the records I don't have to read them. Because that's not the focus of what I do.

2) A Corpus That Does Not Assume I want to Read Anything

I'm certainly not one to suggest reading is obsolete, or that historians should stop going to the archives. But I'm always disheartened to see new scholarly - usually commercial - databases come online that only allow reading. I'm talking about the ones that cost an arm and a leg to university libraries, let you keyword search, but then force you to read a scanned copy of the original while hiding the electronic text layer.

I find these projects infuriating, and would rather pretend they don't exist than struggle to find a research question that's appropriate for their limited interface. The thing that bothers me most about these gated resources is that the publishers who create them are implicitly saying: we don't trust you. They don't trust us because the only thing they possess that allows them to sell their product is the electronic text. That's the part of the project that cost the most and took the longest to create. They think if that starts floating around on the Internet they won't be able to make money anymore.

The OBO is different because it's non-commercial. The OBO trusts us and encourages anyone interested to use the records to explore human knowledge in any way they see fit. For some that means sitting down and reading from digital copies of the original source. For others like me, it means downloading the entire corpus and measuring the rates of transcription errors, or of the impact of courtroom reporters on the vocabulary used in the records, or on the pace of migration in eighteenth century London.

The OBO and its team have trusted us. And from that have poured forth far more research about early modern crime in London than anyone ever could have imagined. Perhaps more research than we need. Meanwhile, researchers like myself continue to ignore the large commercial databases who lock up access to their resources, and hope intently that these people will learn from what is still the best online scholarly database I've worked with. We're starting to see steps forward from some (see the Library of Wales' Newspaper Collection for a good example), but overall there's room to improve.

Until we see a shift away from mandated reading, I'll stick to resources like the OBO. So happy birthday to the OBO and cheers to the project team for trusting us. I hope it's paid off.

Wednesday, April 3, 2013

Programming Historian 2 Lessons I'd Like to See

I've been actively part of the Programming Historian 2 team for the past two years and I've been really pleased to see so many people using and learning from the site, including a number of university courses. I learned to write Python code from the original Programming Historian, and I still regularly reference skills and techniques found in the lessons in my day-to-day research.

My role as an editor of the project means I help guide lessons contributed by others through peer review and editing. I'm also always looking around the blogosphere for people working on cool new techniques or writing guides of their own that I think would be useful for practicing historians. For the most part this is a passive process. I sit, I wait, and I watch. But every once in a while I come across something I'd really like to see. So rather than wait, I thought I'd post my personal wish list of Programming Historian 2 lessons I'd like you to write for all of us.

In no particular order:

  • How do you turn a spreadsheet into a database and write custom queries? The jump from an Excel spreadsheet which you can see to a MySQL or sqlite3 database that you can't see is not an easy one. A lesson on making this leap would be well received and widely used I would imagine.
  • What the heck do you do with topic models? The entire digital humanities world seems fixated on topic models these days. Our most popular lesson by far is a tutorial on Getting Started with Topic Modeling and MALLET. But what are the cool things we can do once we HAVE generated topic models? What can we know? How do we use it responsibly? How do I interpret all these numbers and topics?
  • What can we do with our sources once they've been downloaded? I see so many people using programming to curate sources, but far fewer people asking historical questions of their sources using programming. What are some of the ways we can actually answer questions about the past with programming?
I'd be very happy to hear from anyone who'd like to take on these challenges and create a Programming Historian 2 lesson, or from anyone with an idea of their own they think others could benefit from. Check out our submission guidelines and be in touch.