The Impoverished Social Scientist's Guide to Free Statistical Software and Resources

Last Updated: December 18, 2008

Table of Contents

Free/Open Source Software

For statistical computing resources and other software for accurate computing, such as high-precision libraries, optimizers, and random number generators see our statistical computing page. And for software written by me for data distribution, accuracy, and replication see my software page. For sources of research data, see my Data Resources page.

>
Where to Start
The R Statistical Language The open source statistical language of choice for most tasks. Based on the 'S' language. Thousands of contributed packages
GPL
Other General Statistics Packages
ADE A modular multi-variate analysis program which includes modules for spatial data analysis. Plays well with R. GPL
Adamsoft A general purpose package that specialized in client-server based data management, and large-data/low memory computations. Good for large datasets. GPL
DataPlot A powerful, but somewhat byzantine package from the National Institute of Standards OSS
Gretl An open source econometrics package that plays nicely with R GPL
ExaStat Basic statistics and regression on large data, using Windows.
OSS
Macanova Reasonably powerful & programmable, if not easy to use.
GPL
OpenStat General package focusing on teaching, IRT.
OSS
PSPP Aspires to replace SPSS. Reads SPSS files and provides the data manipulation functions, but is missing most of the analytical features. GPL
Simfit Reasonably powerful with emphasis on simulation , command-line.
OSS
WinIADAMS A free Windows package for exploratory analysis, time series, and linear models. Nice interactive multi-dimensional table browser and interactive plots. No Source
Accurate Statistics

(The following modules for R, are very useful for highly accurate statistical computing on hard problem. For more resources, and computing libraries, see my Resources for Accurate Computing page. )
accuracy Sensitivity analysis and true random number generation GPL
gmp Multiple precision arithmetic
GPL
OpenTURNS Tools for modeling uncerntainty and risks.
GPL
rgenoud Optimizer using genetic algorithms and derivatives GPL
rstream Parallelizable random number generators GPL
trust Trust region based optimization GPL
UNF Universal Numeric Fingerprints -- format independent data validation. GPL
Data-Interactive Graphics (Data Visualization)

Also see the plotting category.
Gaugin Grouping, glyphs, tableplots, oh my.
GPL
GGobi Supports data interactive visualization, exploration, comp, and analysis. Includes automated projection pursuit in high-dimensions.
GPL
Improvise A Java toolkit for linked visualizations.
GPL
KLIMT Interactive analysis of classification and regression trees
GPL
LabPlot
Data analysis and visualization
GPL
Mondrian Mondrian is especially useful for interactive visualization of categorical data, and very large datasets.
No Source
OPEN DX Generates visualizations and animations for very large scale scientific data
OSS
ParaView
Parallel visualization of large datasets.
GPL
prefuse Java visualization toolkit
OSS
Processing A language for rapid developmet of interactive data visualizations. Well integrated with Java and can produced polished visualizations.
OSS
VISIT
Parallel large data visualization software

VISTA
Dynamic, interactive, multi-view graphics. Plus a very interesting visual user-interface, akin to data-desk, but more advanced statistically.
GPL
Data Plotting (and publication-ready graphics)

Almost all of the tools listed on this page have some sort of graphing capabilities. These packages specialize in it. Also see the visualization category.
Gnuplot Command-line driven plots in 2D, and 3D.
GPL
GUPPI Extensible plotting tool for Gnome.
GPL
Jas3 A visualization and curve fitting package in java. GPL
SciGraphica High performance plotting package similar to Microcal Origin. GPL
Image and Plot Analysis

These packages can be used to manipulate images, extract quantitative information from images, including recovering data from published plots and graphs.
DataScan Extracts information from topographic images, microscopic images, and others.
OSS
g3data
Specifically for extracting data from published graphs.
GPL
Image/J Can extract data from scanned maps, charts, graphs and even photos. OSS
Scion Image
Programmable image program with data capture capabilities.
No source
Data Mining

Also see the categories on text mining and machine learning
Auton Labs Software Dozens of independent backages for machine learning, includig many classifiers. Source Available (registration required)
Databionic Clustering, visualization, and classification using emergent self-organizing maps. GPL
Knime Supports data pipelines for data processing, clustering, supervised learning, etc. GUI, CLI and API based. OSS
ORange Predictive modeling, ensemble methods, clustering and validation, using C components and GUI widgets, and Python integration. GPL
Rattle A Gnome based interface that glues together a large number of (clustering, association, machine learing, evaluation) modules in R for data mining
GPL
Shogun Machine learning toolbox with multiple SVM,LDA, LPM classifiers. C++ with interfaces for Octave, R, Matlab, Python GPL
Tanagra Supports data processing streams including clustering, supervised learing, meta-spv, and cross-validation. Provides a GUI interface.
OSS
Qualitative Data Manipulation, Management, Mining and Analysis

A list of commercial and non-commercial tools for qualitative analysis is part of the open directory project and a well-subscribed discussion list about software can be found as part of jisc, and a comparison of QDAS packages is here. The Natural Language Processing TaskView describes many R packages (interfaces to external toolkits) for text understanding. The ML-Interfaces package on BioConductor provides a uniform interface to a large set of machine learning packages in R.
Advene Video annotation
OSS
AnSWR From the CDC, for mixed qualitative/quantitative analysis.
No Source
Automap/ORA Text tagging (similar to Atlas-TI), with more linguistic coding options, visualization and analysis of network of concepts identified . No Source
Elan For complex annotation of audio and video. GPL
EZ-Text
From the CDC, for textual data analysis.
No Source
Gate A toolkit for information extraction from text . GPL
Judge
Performs automatic classification and clustering of documents, GPL
Lingpipe Java librarie for linguistic processing and analysis.
No Commercial Use
Kea Performs automated key phrase extraction. GPL
Language Archiving Technology
A hosted service for text management and analysis.
Hosted
NLTK
A python toolkit for natural language processsing. Includes tutorials on NPL.
GPL
Perl The programming language for supreme text mangling. OSS
Pliny For annotating documents, text and images, and generating maps and graphs of relationships. OSS
SIL tools
If you have a lot of text on-line, the concordance, indexing, and database  from the Summer Institute of Linguistics may be what you need.  No Source
Tabari Uses special purpose rules for categorizing news events from new text. GPL
Tams Textual analysis and markup. Similar to Atlas-TI.
GPL
TextStat Another indexing/concordance package.  GPL
VUE
Visual understanding environment. Allows you to create annotated networks of multimedia objects for presentation and commentary. A sort of non-linear, scholarly, PowerPoint.
OSS
Weft
For qualitative data management and coding.
No Source
Weka Weka is a collection of machine learning algorithms for data mining, including text mining. (R-Weka connects Weka and R, and is available on CRAN). GPL
Wordfish
Scaling software for estimating political positions from texts.
GPL
YALE (now RapidMiner) A flexible standalone package that contains many data mining algorithms. GPL
Spatial Statistics and GIS

In addition to the individual packages below, the Free GIS Site and OpenSourceGis sites maintains lists of many open-source GIS packages. The CISSS Tools Clearinghouse maintains links to many spatial analysis programs. Kelly pace gives a list of links to software for advanced spatiotemporal econometrics. The AI-geostats software page has a links to geo-spatial statistics programs and code. And Rgeo lists lots of contributed packages for doing geospatial statistics with R, including 'fields', 'geoR', 'graper' , 'grass', and 'spatstat'.
Choroware Chloropleth maps with genetic algorithm generated class intervals. GPL
CrimeStat Network, spatial and statistical analysis for crime data. Created for the National Institute of Justice. No Source
Fragstats Designed to compute a wide variety of landscape metrics for categorical map patterns GPL
Geoda Unusual in in its combination of GIS and spatial econmetrics. No Source
Geovista Studio  General GIS toolkit and exploratory data analysis system GPL
Grass One the most powerful, free, geographic information system for the display of spatial data. GPL
LandSerf Land surface visualization and analysis No Source
SatScan Space-time scan statistics -- for analysis of disease and other clusters distributed in space and time No Source
SAGA Combines GIS with kriging and terrain analysis GPL
Spatial Econometrics Lib.
A library of Matlab functions for advanced spatial, and spatiotemporal econometric analysis
OSS

STARS
Space time analysis of regional systems. Designed for the dynamic exploratory analysis of data measured for areal units at multiple points in time. If you have spatial time-series data, check this.
GPL
Survey Data Collection and Analysis

The general software packages above have some facilities for survey analysis. The programs below specialize in data collection and/or the analysis of complex surveys. Also see the Epidemiology section.
AM Handles analysis of complex survey samples, such as NAEP and TIMMS No Source
dopoxtools Free research web survey hosting Hosted
Mod_survey A very mature open source survey system. It is implemented as a drop-in apache module. It supports creation of survey templates using XML, and export of the resulting data in a number of interchange formats. Mod_survey can be configured in a decentralized way, so that all users on a particular web server can administer their own surveys independently. (Also see YaaCs, below)
GPL
OpenSurveyPilot Server based web survey system
GPL
PHPEsp PHP based web survey system GPL
Lime Survey PHP based web survey system
GPL
PEBL A programming environment for building interactive psychology experiments GPL
protogenie Free research web survey hosting Hosted
PsychExps A repository of experimental design scripts to be run under the macromedia authorware environment. Mixed
Quex Suite Web based CATI system with integrated VOiP (Asterix), XML form language, and paper form scanning capability.. GPL
SurveyWiz Simple JavaScript based web survey system
GPL
TESS Time-Sharing Experiments for the Social Sciences. n NSF funded infrastructure to provide both web and phone surveys. Hosted
WebExp2 A java-based system for on-line psych experiments. No Source
YaaCs A CATI system that uses Mod_survey for the data collection, and offers additional management of other phases of the survey work flow -- questionnaire building, interviewer management, etc. GPL
Agent-Based Simulation

The International Society for Artificial Life maintains a list of links to many agent-based simulation framework. 
Ascape Agent based simulation package
GPL
breve Simulation in a 3-D world, using Python or a simple scripting language. GPL
EVO
A simulation environment for co-evolution, based on SWARM
OSS
MASON
A java-based agent-based modeling system popular in political science OSS
NetLogo
An updated dialect of the Logo language for multi-agent simulation
No Source
REPAST A multi agent simulation toolkit, with multiple implementations and built in adaptive features
OSS
Sesam Simulation system with cool visual model building interface. OSS
SOAR Agent based modeling based on cognitive/AI constructs. GPL
Swarm A mature, full-featured framework for agent-based modeling, built in Objective C
GPL
Dynamic Event Simulation

This overlaps with Agent-Based Simulation above. I have listed only packages below, but several programmng libraries are also available, including: DSOL (Java), SimPy (Python), Adevs (C++) and DeX (Python, C++, Scripting).
Desmo-J Discrete event simulation framework GPL
OMNet++ OMNeT++ is a component-based, modular and open-architecture simulation environment with strong GUI support and an embeddable simulation kernel, focussing on communication networks, but general enough to be used for network, systems, and business process simulation. Academic Source License (not open source)
Monte Carlo and Markov-Chain Monte Carlo (MCMC) Simulation

R, and many of the other general packages above can be used for MC simulation. R also has a number of modules to perform Bayesian MCMC analysis directly, and through communicating with BUGS, and JAGS.
JAGS
Just another GIBBS sampler. A program for Bayesian hierarchical models. ("Not unlike BUGS")
GPL
MCMCpack
An R module to perform MCMC based analysis. Very easy to use, since it contains a large variety of pre-configured models
GPL
McSim A specially tailored Monte Carlo simulation package. Goes well beyond general packages.
GPL
OpenBugs Open source rewrite of BUGS for bayesian simulation GPL
WinBUGS
Still the best BUGS for windows, but not OSS.
No Source
Specialized Statistical Packages
Blossom multi-response permutation tests No Source
Fityk
Nonlinear peak fitting.
GPL
Gambit game theory made simple(r) OSS
gSwing
Election result tracking and display
GPL
M.D. Anderson Cancer Center Has useful biostat software from the biostats department.
Mixed.
MDSX Multidimensional Scaling Routines for Windows No Source
MPCA
Discrete and independent component analysis.
GPL
MX Structureal Equation Modeling (like LISREL)
No Source
PAST PAlaeontological STatistics. Not strictly social science, of course, but the correspondence analysis, geometric analysis and cladistics could be applied fruitfully. No Source
Sitkis Computes common bibilometric network statistics. No Source
Permap
Perceptual maps created through interactive multidmensional scaling.
No Source
TETRAD A LISREL like structural equation modeling program GPL
TDA Transition Data Analysis.A system for analyzing event data , supports lots of options and models GPL
Voteview
Voteview and nominate are for viewing and analyzing roll-call voting.
GPL
Epidemiology

The CDC Software Page also offers a set of special packages for sampling design factors, meta-analysis, and spatial analysis.The WWW Virtual Epidemiology Library. Also see the category on survey tools.
MIX Guided interactive meta-analysis.
GPL
Epidata Provides for programmed data entry and simple analysis.
No source.
Epigrass Epigrass is a software for visualizing, analyzing and simulating of epidemic processes on geo-referenced networks.
GPL
Epi-info Epidemiological statistics, maps, reports.
No Source
Openepi
Javascript-based (on or off-line) simple epidemiological statistics.
OSS
Netepi
Web based secure data entry and analysis for epidemiology.
GPL
WinPepi
over 75 modules for common epidemiolical methods.
No Source
Data Cleaning, and Management

For managing qualitative data, see the Text Tools section. For other database options see  the Free SQL List and The ACM's Sigmod List 
Berkeley DB
A fast key-value based DB. Very lightweight (much more lightweight than SQL, and does not require separate server running). Very fast for key-based retrievals.Also see thefilehash and R.huge packages for using key-value DB's in R.
OSS
CCOUNT Does data cleaning, advanced cross-tabulation, and other market research function. Also reads many mainframe-style data formats (e.g. EBCDC, Column Binary). Modeled after SPSS Quantum. GPL
CSPRO Does form base data entry, crosstabulation, and mapping. From the U.S. Census.
GPL
DataCleaner Tools for data review and editing.
OSS
HDF
Hierarchical Data Format -- a portable format for representing and manipulating large scientific datasets. The latest version is compatible with netcdf. Also see the netcdf packages for R.

IVEware Multiple imputation for missing data OSS
MySql One of the most mature and stable open source SQL databases. GPL
netCDF
A portable format for repesresenting and manipulating large scientific datasets. Also see the netcdf package in R; the NCO package for manipulating netcdf data on the command line, and the Parallel-NetCDF package for high-speed access to NetCCDF data.
GPL
PostGRES One of the most mature and stable open source SQL databases.
GPL
R DBI
Connects R and SQL databases.
GPL
Matrix Algebra, Symbolic Algebra, and Computational Algebra Systems

These are standalone systems. For related programmer's libraries see my Resources for Numerical Accuracy listing. The following feature comparison contrasts these and a dozen other more specialized packages.
Axiom
Computer algebra. Lots of functions. Good documentation GPL
Giac/Xcas A computer algebra system. Included limited compatibility with Maple, MuPad and TI89 syntax; arbitrary precision
GPL
Ginac A computer algebra system. (C++ Library)
GPL
FreeMat Matrix algebra system. Matlab compatibility and built-in parallelization.
GPL
GAP
Computer algebra system for group theory. Computatinal discrete algebra.
OSS
JACAL. A computer algebra system. GPL
Magnus
Computer algebra system for group theory. GPL
matrex
A 'spreadsheet' where each cell is a matrix. Provides graphing, presentations, multi-threaded function-based calculations GPL
Mathomatic
Yet another computer algebra system
GPL
Maxima A computer algebra system. GPL
OCTAVE A matrix manipulation/mathematics environment like Matlab. Mature. GPL
PARI/GP A computer algebra system with arbitrary precision arithmetic, like Maple or Mathematica. GPL
RLAB A matrix manipulation environment.
GPL
SAGE General purpose mathematical computing environment
GPL
SciLab A matrix manipulation/mathematics environment like Matlab. Mature. GPL
Tela Tensor computing
GPL
YACAS Yet another computer algebra system. (Eponymous)
Comes with Euler, for numerical programming.
GPL
Yorick
An older matrix language.
OSS
Social Network Analysis

Also see the Spatial category above for software with complementary and overlapping spatial network and display features.
Bibexcel
Bibliometric citation analysis.
No Source
CiteSpace
Visualizes networks over time.
No Source
Cfinder
Uses the clique percolation method to find overlapping dense groups of nodes.
No Source
Egonet
Collection and analysis of egocentric network data.
No Source
GraphViz
Mathematical graph visualization
OSS
Insoshi
A social network platform -- useful for data collection.
GPL
Nettvis
Analyze and visualize social networks. Includes an on-line service.
GPL
NetworkX
Python toolkit for visualization and analysis
OSS
NWD
Network workbench, visualization and descriptives.
OSS
Pajek Graph clustering, partitioning, citation analysis, network comparison (differences, unions), metrics. No Source
Proximity
Visualization and knowledge discovery from heterogenous relational networks.
OSS
R Modules for Network Analysis
A number of R modules mainatined by Carter Butts, including SNA, network, nettheory, metamatrix
. Also see Statnet for more R network packages.
OSS
Sitkis Computes common bibilometric network statistics. No Source
SocNetV
Provides core graph measures for social network analysis
GPL
Sonia
Animated visualizations of logitudinal social networks
GPL
STOCNET Analysis of some interesting models, including evolution of social networks, blockmodeling, dyadic variable and actor anlaysis, maximum likelihood analysis of longitudinal (evolution of) networks (through SIENNA) , core network analysis.
GPL
Tulip Visualization for extremely large graphs. Plugins are available for clustering and core graph metrics. GPL
VISONE Provides core graph measures for social network analysis No source
WinMine Bayesian and dependency (decision-tree) network builder No source
Differential Equations and Dynamic Simulation

A good list of dynamic simulation packages is maintained by the SIAM activity group on dynamic systems.
PETC scientific toolkit for differential equations No Source
scirun A scientific environment for simulation and PDE's. No source
SUNDIALS Nonlinear and differential/algebraic equation Solver OSS
Machine Learning

A good list of machine learning tools is at mloss.org. Also see the categories on text mining and data mining
dysii C++ Library for probablistic learning within dynamic systems, high peformance. GPL

Open source software, since it is inherently extensible, offers unparalleled opportunities to the researcher to do cutting edge research. Because it is free, it offers opportunities to the student or practitioner on a limited budget. This list concentrates on statistical packages that offer high-level statistical functions and that make source code freely available. Non open source free software is included only when it offers significant functionality that is not otherwise available.  A number of software companies offer academic discounts, limited trials or other closed but usable software. See below for other lists that include commercial software.

Analyzing Data

There are some web-based statistics tutorials out there, but none that I like. I recommend some readings:

Other Lists of Statistical Software Packages

Caveats

"Entia non sunt mutiplicanda sine necessitate" - William of Ockham's rule
"Ad indicia spectate." - Micah's corollary
"Doing econometrics is like trying to learn the laws of electricity by playing the radio." - Orcutt's observation
"One problem with political science is that its laboratories are unsecured, allowing real people to roam around inside them, spitting in test tubes and fiddling with computers" - Walter Kirn
"You can see a lot, just by looking." - Yogi Berra

Search this site for: Search tips
[ Things to do with this page: | Print it!  | Comment on it! | Track it! ]
Copyright © 1995-2010 Micah Altman