Visualizing Locations in the Internet Archive .ca Wide Scrape Sample

Standard

Taking the full text of my sample of my Canadian (.ca only) websites (currently being finessed to amount 622,365 URLs out of a scrape total of 8,512,275 or 7.31%), I ran it through Stanford NER and extracted popular locations, organizations, and people. This is a morning’s work, mainly as I let my desktop crunch away at some other stuff, so I really need to preface the post that the data has not been cleaned up.

The results were interesting but fairly dry: “Canada” was the top location, for example, followed by Ontario, Toronto, Ottawa, Alberta, etc. The United States comes out as under-represented mainly because we have so many spellings of the word (US, u s, America, United States, etc.). There will be a similar issue with the United Kingdom. If this turns into ‘real’ research rather than tinkering, again, there’s a lot of cleaning up to do. But overall, we can get a rough sense of different countries and how they appeared in this sample.

Thanks to IBM Many Eyes we can throw this stuff at the wall and see what comes out.

Screen Shot 2014-02-06 at 11.47.16 AM

In this graphic, we see different countries and how they are mentioned. With the caveats that this is rough data, we can see the big parties emerge. Canada drowns out all others, unsurprising given the sample. But the other big parties emerge: the United States, Russia, China, Brazil, Australia. Western Europe. But almost nothing in sub-saharan Africa, which tells you something about this coming together. The way that this map emerges I think shows that it’s working a bit.

Let’s take Canada alone: Continue reading

Why Canada’s Open Data Initiative Matters to Historians

Standard

Screen Shot 2014-01-20 at 1.04.16 PMOK, you’re all forgiven: when you hear ‘open data,’ the first thing that springs to mind probably isn’t a historian (to some historians, it’s the first episode of the BBC show ‘Yes, Minister’). In general, you’d be right: most open data releases tend to do with scientific, technical, statistical, or other applications (releasing bus route information, for example, or the location of geese at the UW campus). Increasingly, however, we’re beginning to see a trickle of historical open data.

Open government is, in a nutshell, the idea that the people of a country should be able to access, read, and even manipulate the data that a country generates. It is not new to Canada: Statistics Canada has been running the Data Liberation Program since at least late 1996, and there have been predecessors before that, but the current government has been pushing an action plan which has materialized in data.gc.ca.

While I am not a fan of the current government’s approach to knowledge more generally, I am happy with the encouraging moves in this realm. Criticism of the government is often very deserved, but we should celebrate good moves when they do happen, however slowly this may occur. Indeed, if the government is opening up their data, maybe it should inspire publicly-funded scholars to do the same: think of what we could learn from the quantitative findings of the Canadians and their Pasts project, for example!

In this post, I want to show some of the potential that is there for learning about the past through Canadian open data (drawing on some of the provincial datasets too), in the hopes that this will spur interest in maybe getting more released. I even have a little bit for everybody: There’s data here from which political, military and social historians can draw.  Let me show you how. Continue reading

Comparing Web Archives by Using Large Numbers of Images

Standard

In my last post, I walked people through my thoughts as I explored a large number of images from the Wide Web Scrape (using, as noted there, methods from Lev Manovich). In this post, I want to put up three images and think about how this method might help us as historians. Followers of my research might know that I am also playing around with the GeoCities web archive. GeoCities was arranged into neighbourhoods, from the child-focused EnchantedForest to the Heartland of family and faith or the car enthusiasts of MotorCity. Each neighbourhood was, in some ways, remarkably homogenous.

Let’s take every JPG from the ‘Athens’ (the teaching/philosophy/etc. focused area of GeoCities) and see what we find. Continue reading

Exploring 50,000 Images from the Wide Web Scrape, Initial Thoughts

Standard

As followers of this blog know, one of my major research activities involves the exploration of the 80TB Wide Web Scrape, a complete scrape of the World Wide Web conducted in 2011 and subsequently released by the Internet Archive. Much of this to date has involved textual analysis: extracting keywords, running named entity recognition routines, topic modelling, clustering, setting up a search engine on it, etc. One myopia of my approach has been, of course, that I am dealing primarily with text whereas the Web is obviously a multimedia experience.

Inspired by Lev Manovich’s work on visualizing images [click here for the Google Doc that explains how to do what I do below], I wondered if we could learn something by extracting images from WARC files. I took the WARC files connected to the highest overall percentage of .ca domain files, drawing on my CDX work, and quickly used unar to decompress them. The files that I drew on were the ten WARC.GZ files from this collection, totally 10GB compressed or

I then used Mathematica to go through the decompressed archives, look for JPGs (as a start, I’ll expand the file type list later), and then transform each image into a 250 x 250 pixel square. As there were 50,680 images, this was a bit lower resolution than I normally use but I felt that this was ideal. Using Manovich’s documentation above, I then took these 50,680 images and created a montage of them. Each image was shrunk down even further so the file size would be manageable, and then I wouldn’t have to worry about copyright when I posted it here. Continue reading

Exploring Canada’s Parliamentary History through data.gc.ca: Occupations, 1867-2010

Standard

I’m not a political historian, although I did serve as the Secretary-Webmaster of the Political History Group for two years. Today, after having today’s class all prepared but not quite in the right mind frame to look over some manuscript drafts, I decided to play with some of the data that you can find in Canada’s open data repository.

I view this as sort of “putting in the hours” like a pilot would: I now find myself wrapped up in writing and using off-the-shelf analysis software that it’s good to keep my data analysis and programming skills a bit honed. But I was also thinking: it’s a great example of showing what, with a bit of computational skill, you can learn in about thirty minutes of unstructured data play. And, like my post from yesterday, my dream with all of this is that somebody will stumble across this and decide that there is some potential for their own work.

In any event, this morning I stumbled across History of the Federal Electoral Ridings, 1867-2010 and grabbed the English-language CSV file. It’s a big one, containing the information of 38,778 candidates for federal office in Canada. It’s a thirteen column file, with the following entries: Continue reading

Historians Love JSON, or One Quick Example of Why it Rocks

Standard

I was looking into the Canadiana Discovery Portal API at the behest of a colleague, and while tweeting my excitement at the results, had another Canadian colleague note that he also loves the JSON format. It made me realize: it kind of all probably does seem a bit incomprehensible. So why should a historian like JSON, and what’s so cool about an API like Canadiana’s?

Note: this post runs on the assumption that you’ve read or are open to such things as the Programming Historian.

In a nutshell, JSON is a format that lets you transmit attributes and values. Say I own three iPhones and one iPad (I don’t), you might get results that look like this:

{
"iphones" : "3",
"ipads" : "1"
}

What Canadiana has done is basically let you grab the data that would normally come in HTML format in JSON format, which makes it really easy for a computer program that you’re writing talk to it. So if you’re looking for search results relating to Waterloo you’d get these results if you did it the normal way – and this way if you requested JSON format by appending &fmt=json to the URL.

I won’t reproduce the documentation, which you can grab here, but to me the most exciting facet of this API is that you can request full-text documents via it. It’s a scraper’s dream.

An Example of What We Can Do With Such Things
Enough abstract details. Here’s what I did today to grab a ton of material relating to a specific query. Continue reading

Herrenhausen Poster: Historians and Web Archives

Standard

I meant to post this earlier, but the winter holiday interfered and I decided to actually take a week and a half away from work. This was the poster that accompanied the lightning talk that I gave at the Herrenhausen Digital Humanities conference in Germany, which has already been posted on my blog. Giving a poster was a completely new experience for me, but it was extremely useful in distilling down my thoughts and experimenting with a new format.

If you have any questions about this, as always, don’t hesitate to touch base with me. Hat tip to Timothy Bristow who provided design work on the poster, and to the receptive audience that gave me some fantastic ideas for future work on this project.

Herrenhausen_PRINT