The OpenScience Project

What, exactly, is Open Science?

Posted by Dan on July 28, 2009 at 11:45 am | Categories: Open Data, Policy, Science, open science | 25 Comments

I was recently asked to define what Open Science means. It would have been relatively easy to fall back on a litany of “Open Source, Open Data, Open Access, Open Notebook”, but these are just shorthand for four fundamental goals:

Transparency in experimental methodology, observation, and collection of data.
Public availability and reusability of scientific data.
Public accessibility and transparency of scientific communication.
Using web-based tools to facilitate scientific collaboration.

The idea I’ve been most involved with is the first one, since granting access to source code is really equivalent to publishing your methodology when the kind of science you do involves numerical experiments. I’m an extremist on this point, because without access to the source for the programs we use, we rely on faith in the coding abilities of other people to carry out our numerical experiments. In some extreme cases (i.e. when simulation codes or parameter files are proprietary or are hidden by their owners), numerical experimentation isn’t even science. A “secret” experimental design doesn’t give skeptics the ability to repeat (and hopefully verify) your experiment, and the same is true with numerical experiments. Science has to be “verifiable in practice” as well as “verifiable in principle”.

In general, we’re moving towards an era of greater transparency in all of these topics (methodology, data, communication, and collaboration). The problems we face in gaining widespread support for Open Science are really about incentives and sustainability. How can we design or modify the scientific reward systems to make these four activities the natural state of affairs for scientists? Right now, there are some clear disincentives to participating in these activities. Scientists are people, and we’re motivated by most of the same things as normal people:

Money, for ourselves, for our groups, and to support our science.
Reputation, which is usually (but not necessarily) measured by citations, h-indices, download counts, placement of students, etc.
Sufficient time, space, and resources to think and do our research (which is, in many ways, the most powerful motivator).

Right now, the incentive network that scientists work under seems to favor “closed” science. Scientific productivity is measured by the number of papers in traditional journals with high impact factors, and the importance of a scientists work is measured by citation count. Both of these measures help determine funding and promotions at most institutions, and doing open science is either neutral or damaging by these measures. Time spent cleaning up code for release, or setting up a microscopy image database, or writing a blog is time spent away from writing a proposal or paper. The “open” parts of doing science just aren’t part of the incentive structure.

Michael Faraday’s advice to his junior colleague to: “Work. Finish. Publish.” needs to be revised. It shouldn’t be enough to publish a paper anymore. If we want open science to flourish, we should raise our expectations to: “Work. Finish. Publish. Release.” That is, your research shouldn’t be considered complete until the data and meta-data is put up on the web for other people to use, until the code is documented and released, and until the comments start coming in to your blog post announcing the paper. If our general expectations of what it means to complete a project are raised to this level, the scientific community will start doing these activities as a matter of course.

If you meet a scientist who tells you that they did a fantastic experiment and have wonderful data, you naturally ask them to email you a reprint. Any working scientist would be perplexed if the response was: “Oh, I’m not going to be writing this work up for publication.” It would be absolute nonsense in the culture of science to not publish a report in a journal on the work you have done. And yet, no one seems surprised when scientists are too busy or too secretive to release their data to the community. We should be just as perplexed by this. Instead of complaining about the reward and incentive systems, we should be setting the standard higher: “What do you mean that you haven’t got around to putting your data on the web? You aren’t done yet!” Or: “How can I possibly review this paper if I can’t see the code they were using? There’s now way for me to tell if they did the calculation right.” We’re going to have to raise the expectations on completing a scientific project if we want to change the culture of science.

Saros: Distributed Pair Programming

Posted by Dan on June 26, 2009 at 4:11 pm | Categories: Science, Software | No Comments

pairon I’m a big fan of pair programming, which is one of the primary modes of software development in my research group. Usually, two people sitting together can spot errors that one alone can’t, and the pace of the coding and debugging is often much higher than two people sitting separately. I don’t know if my graduate students are as appreciative of this technique as I am — how many students want their advisor right next to them for the entire afternoon, taking over their keyboard, and seeing all the IM requests coming across the screen? But as an researcher, I find it gives me a much greater feel for what we’re actually doing in the lab, sort of like a small group meeting where we’re both looking at the same data or the same plot. It would be great if there were a way to separate the pair-programming from the “sitting in the same cramped cubicle” part of the equation.

Christopher Oezbek just let us know about a cool open source Eclipse plugin called Saros. This lets two people sitting in different locales collaboratively edit and work on the same project. I’ve seen similar things in the editor SubEthaEdit (which is not open source), but Saros will let two programmers do this at the project level (with multiple files open), not just at the file level. It looks like a very cool tool to avoid those overcrowded cubicles (or the famous PairOn chair pictured above).

Saros is listed in our Software Engineering and Tools sections.

Machine learning open source software

Posted by Dan on June 12, 2009 at 8:37 am | Categories: Software, open science | 3 Comments

Cheng Soon Ong just emailed me about mloss.org, a community creating a comprehensive open source machine learning environment. Mloss.org is essentially a community portal with lots of detailed information about each of the listed projects. One of the more interesting features of their site is that they’ve tied specific software to publication in an associated journal, the Journal of Machine Learning Research to make it easy for users of the software to find and maintain a citation trail to the work of the original developers. The journal itself encourages open source submissions and automatically ties publication of papers related to the software to appearance at the portal.

This last bit is a very clever idea. Would a broader electronic journal (perhaps the Journal of Open Science) would be a useful way to give open projects (Open Source, Open Data, Open Notebook) more citation currency?

Scientific Software Wants To Be Free

Posted by Dan on May 26, 2009 at 12:20 pm | Categories: Policy, Science, open science | 4 Comments

Go read this wonderful manifesto over at arXiv: Astronomical Software Wants To Be Free: A Manifesto by Weiner et al. The authors talk about some of the barriers to astronimical software development that are true in all scientific fields. The chief barrier they see is that there are no incentives (and are some real disincentives) for authors to release software and documentation to other users. The recommendations are great (modified here only to include all scientific fields):

We should create an open central repository location at which authors can release software and
documentation.
Software release should be an integral and funded part of projects.
Software release should become an integral part of the publication process.
The barriers to publication of methods and descriptive papers should be lower.
Programming, statistics and data analysis should be an integral part of the curriculum.
There should be more opportunities to fund grass-roots software projects of use to the wider community.
We should develop institutional support for science programs that attract and support talented scientists who generate software for public release.

The whole thing is a great read. Check it out!

Quantum Espresso!

Posted by Dan on January 13, 2009 at 9:46 pm | Categories: Science, Software, open science | 3 Comments

I just got email from Brandon Wood about an open source project called Quantum Espresso (formerly known as PWSCF), which is a rather extensive open-source project for DFT-based electronic structure calculations. It appears to be a refactoring of some established codes (PWscf, PHONON, CP90, FPMD, Wannier) that have been developed and tested by some of the original authors of novel electronic-structure algorithms – from Car-Parrinello molecular dynamics to density-functional perturbation theory – and applied in the last twenty years by some of the leading materials modeling groups worldwide.

There are definitely some scientific niches which desperately need open source codes (plane wave DFT is one of the ones that comes to mind), so I’m very pleased to learn about this project.

New Software: Reference Tools, Atomic Physics, and Engineering

Posted by Dan on December 2, 2008 at 12:02 pm | Categories: Science, Software | No Comments

Some new software to point out today:

In the Tools section, we have a new link to cb2bib a tool for rapidly extracting unformatted bibliographic references from email alerts, journal web pages, and PDF files.
In the Atomic & Molecular Physics section we have a new link to FELLA, which stands for Free Electron Laser Atomic, Molecular, and Optical Physics Program Package. FELLA is a joint project of Christian Buth from LSU and Robin Santra at Argonne National Laboratory.
In the Engineering section, we have two new links, one for View3D, a command-line tool for evaluating radiation view factors for scenes with complex 2D and 3D geometry, and one for OSIV a program that performs cross-correlation analysis of particle image velocimetry (PIV) images.

Check them out, and as always, be sure to suggest your favorite open source scientific software!

Earmarks for Science

Posted by Dan on October 8, 2008 at 8:51 am | Categories: Policy, Science | 2 Comments

At the debate last night, John McCain brought up (twice!) for special scorn an example of spending on earmarks. His target? The “overhead projector for a planetarium”. It wasn’t the first time he’s brought up this earmark request up either. Bad Astronomy had a good post on how McCain’s comments on planetaria make him “literally antiscience”. The projector in question is hardly your run-of-the-mill overhead projector. The Adler planetarium in Chicago has a “Sky Theater” or a hemi-spherical dome on which it can project just about anything if you have the right equipment. Notre Dame (where I teach) has a very similar set-up in our digital visualization theater. The projectors we use were modeled on the current system at the Hayden planetarium, and just to give you some scope, we have a 50-foot high domed ceiling for a hexagon array of chairs that seats 136 students. The system is run with 10 computers, 8 of which do nothing but render 3D objects and transform them for hemispherical projections. It was a million dollar facility that goes a long way toward making all aspects of science visible to our students. In fact, as earmarks go, the planetarium projector at the Adler is a lot less offensive than some other projects (notably a certain bridge in Alaska).

In the past, McCain has also targeted for scorn an expenditure to study the “DNA of bears in Montana”. To be fair, other earmarks have also been his target: The Woodstock museum, and the bridge to nowhere (at least until he picked a running mate who was in favor of that same bridge) have also been the targets of McCain’s anti-pork ire. But last night, he seemed to express a special loathing for earmarks for science.

Now, a good case can be made (and should be made) that using earmarks to fund basic science research or science outreach is just bad policy. In fact, I’d be happier if the budgets for science-related earmarks were turned over to the NSF in order to fund peer-reviewed and merit-based proposals. But if the earmarks are the only way to fund science outreach projects like the Adler’s planetarium, then count me in. It is certainly a better use of money than David Vitter’s proposed earmark of $100,000 for a group that promotes “creation science”. In fact a list of examples of religious earmarks pointed out by Americans United for the Separation of Church and State are all worse than the Adler planetarium project.

Exhibit: make your data web-accessible

Posted by Dan on September 18, 2008 at 10:07 am | Categories: Open Data, Software, open science | 2 Comments

David Karger’s lab at MIT has developed some neat web software called exhibit, which is designed to let non-ultra-sophisticated individuals publish data in ways that make it immediately accessible and interactive for people encountering it on the web. With exhibit, a scientist with a lot of data doesn’t need to manage a database (mysql, etc.) and program a front end for it. Instead, they can put a data file (as simple as a spreadsheet) and a presentation file (written in basic html) on their web site and they’re done. There are a couple of great examples including an interactive elements table that one of Karger’s undergraduates put together.

Exhibit is a three-tier web application framework written in Javascript, which you can include like you would include Google Maps. The integration with Google maps is quite impressive. One can imagine using it to display geographic or other spatial data. In fact, here’s an exhibit of Danish monthly weather records since 1874. And here’s a great example of exhibit being used to display a bibliography for the MIT haystack group.

Other useful related projects are Timeplot and Timeline for placing interactive time data on a web page.

New Software: Data Mining

Posted by Dan on August 7, 2008 at 12:52 pm | Categories: Science, Software | 6 Comments

Scientific Software Some new software is in our Knowledge Discovery and Data Mining section. I can remember a time when “data mining” was a bit of an epithet in science (like “fishing expedition”), but now it has become an established way of finding links and connectivities in large data sets. Three new open source data mining programs appeared on our radar recently:

KNIME, pronounced [naim], is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models.
RapidMiner (formerly YALE) – not much detail is known about this package
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

Researching Open Science

Posted by Dan on July 31, 2008 at 2:41 pm | Categories: Science, open science | 1 Comment

I don’t know how I missed this before, but there’s a really interesting article from 2006 up at the Harvard Business School “Working Knowledge” site. It details some of Karim Lakhani’s results from a paper called ‘The Value of Openness in Scientific Problem Solving‘. The paper itself is actual detailed research on different methods of scientific problem solving that is really worth a read for anyone in the Open Science movement. They went looking to see if “Broadcast Search” (i.e. telling the world what problem you are working on) is an effective means of problem solving. My favorite part of the paper:

Our most counter-intuitive finding was the positive and significant impact of the self-assessed distance between the problem and the solverâ€™s field of expertise on the probability of creating a winning solution. This finding implies that the farther the solvers assessed the problem as being from their own field of expertise, the more likely they were to create a winning submission. We reason that the significance of this effect may be due to the ability of â€œoutsidersâ€ from relatively distant fields to see problems with fresh eyes and apply solutions that are novel to the problem domain but well known and understood by them.