scikit-learn: machine learning in Python
Easy-to-use and general-purpose machine learning in Python
Scikit-learn integrates machine learning algorithms in the tightly-knit scientific Python world, building upon numpy, scipy, and matplotlib. As a machine-learning module, it provides versatile tools for data mining and analysis in any field of science and engineering. It strives to be simple and efficient, accessible to everybody, and reusable in various contexts.
License: Open source, commercially usable: BSD license (3 clause)
Documentation for scikit-learn version 0.13.1. For other versions and printable format, see Documentation resources.
User Guide¶
- 1. Installing scikit-learn
- 2. Tutorials: From the bottom up with scikit-learn
- 1. An introduction to machine learning with scikit-learn
- 2.2. A tutorial on statistical-learning for scientific data processing
- 2.2.1. Statistical learning: the setting and the estimator object in the scikit-learn
- 2.2.2. Supervised learning: predicting an output variable from high-dimensional observations
- 2.2.3. Model selection: choosing estimators and their parameters
- 2.2.4. Unsupervised learning: seeking representations of the data
- 2.2.5. Putting it all together
- 2.2.6. Finding help
- 3. Supervised learning
- 3.1. Generalized Linear Models
- 3.1.1. Ordinary Least Squares
- 3.1.2. Ridge Regression
- 3.1.3. Lasso
- 3.1.4. Elastic Net
- 3.1.5. Multi-task Lasso
- 3.1.6. Least Angle Regression
- 3.1.7. LARS Lasso
- 3.1.8. Orthogonal Matching Pursuit (OMP)
- 3.1.9. Bayesian Regression
- 3.1.10. Logistic regression
- 3.1.11. Stochastic Gradient Descent - SGD
- 3.1.12. Perceptron
- 3.1.13. Passive Aggressive Algorithms
- 3.2. Support Vector Machines
- 3.3. Stochastic Gradient Descent
- 3.4. Nearest Neighbors
- 3.5. Gaussian Processes
- 3.6. Partial Least Squares
- 3.7. Naive Bayes
- 3.8. Decision Trees
- 3.9. Ensemble methods
- 3.10. Multiclass and multilabel algorithms
- 3.11. Feature selection
- 3.12. Semi-Supervised
- 3.13. Linear and Quadratic Discriminant Analysis
- 3.14. Isotonic regression
- 3.1. Generalized Linear Models
- 4. Unsupervised learning
- 4.1. Gaussian mixture models
- 4.2. Manifold learning
- 4.3. Clustering
- 4.3.1. Overview of clustering methods
- 4.3.2. K-means
- 4.3.3. Affinity Propagation
- 4.3.4. Mean Shift
- 4.3.5. Spectral clustering
- 4.3.6. Hierarchical clustering
- 4.3.7. DBSCAN
- 4.3.8. Clustering performance evaluation
- 4.4. Decomposing signals in components (matrix factorization problems)
- 4.5. Covariance estimation
- 4.6. Novelty and Outlier Detection
- 4.7. Hidden Markov Models
- 5. Model selection and evaluation
- 5.1. Cross-Validation: evaluating estimator performance
- 5.2. Grid Search: setting estimator parameters
- 5.3. Pipeline: chaining estimators
- 5.4. FeatureUnion: Combining feature extractors
- 5.5. Model evaluation
- 5.5.1. Classification metrics
- 5.5.1.1. Accuracy score
- 5.5.1.2. Area under the curve (AUC)
- 5.5.1.3. Average precision score
- 5.5.1.4. Confusion matrix
- 5.5.1.5. Classification report
- 5.5.1.6. Precision, recall and F-measures
- 5.5.1.7. Hinge loss
- 5.5.1.8. Matthews correlation coefficient
- 5.5.1.9. Receiver operating characteristic (ROC)
- 5.5.1.10. Zero one loss
- 5.5.2. Regression metrics
- 5.5.3. Clustering metrics
- 5.5.4. Dummy estimators
- 5.5.1. Classification metrics
- 6. Dataset transformations
- 6.1. Preprocessing data
- 6.2. Feature extraction
- 6.2.1. Loading features from dicts
- 6.2.2. Feature hashing
- 6.2.3. Text feature extraction
- 6.2.3.1. The Bag of Words representation
- 6.2.3.2. Sparsity
- 6.2.3.3. Common Vectorizer usage
- 6.2.3.4. Tf–idf term weighting
- 6.2.3.5. Applications and examples
- 6.2.3.6. Limitations of the Bag of Words representation
- 6.2.3.7. Vectorizing a large text corpus with the hashing trick
- 6.2.3.8. Customizing the vectorizer classes
- 6.2.4. Image feature extraction
- 6.3. Kernel Approximation
- 6.4. Random Projection
- 6.5. Pairwise metrics, Affinities and Kernels
- 7. Dataset loading utilities
- 7.1. General dataset API
- 7.2. Toy datasets
- 7.3. Sample images
- 7.4. Sample generators
- 7.5. Datasets in svmlight / libsvm format
- 7.6. The Olivetti faces dataset
- 7.7. The 20 newsgroups text dataset
- 7.8. Downloading datasets from the mldata.org repository
- 7.9. The Labeled Faces in the Wild face recognition dataset
- 7.10. Forest covertypes
- 8. Reference
- 8.1. sklearn.cluster: Clustering
- 8.2. sklearn.covariance: Covariance Estimators
- 8.2.1. sklearn.covariance.EmpiricalCovariance
- 8.2.2. sklearn.covariance.EllipticEnvelope
- 8.2.3. sklearn.covariance.GraphLasso
- 8.2.4. sklearn.covariance.GraphLassoCV
- 8.2.5. sklearn.covariance.LedoitWolf
- 8.2.6. sklearn.covariance.MinCovDet
- 8.2.7. sklearn.covariance.OAS
- 8.2.8. sklearn.covariance.ShrunkCovariance
- 8.2.9. sklearn.covariance.empirical_covariance
- 8.2.10. sklearn.covariance.ledoit_wolf
- 8.2.11. sklearn.covariance.shrunk_covariance
- 8.2.12. sklearn.covariance.oas
- 8.2.13. sklearn.covariance.graph_lasso
- 8.3. sklearn.cross_validation: Cross Validation
- 8.3.1. sklearn.cross_validation.Bootstrap
- 8.3.2. sklearn.cross_validation.KFold
- 8.3.3. sklearn.cross_validation.LeaveOneLabelOut
- 8.3.4. sklearn.cross_validation.LeaveOneOut
- 8.3.5. sklearn.cross_validation.LeavePLabelOut
- 8.3.6. sklearn.cross_validation.LeavePOut
- 8.3.7. sklearn.cross_validation.StratifiedKFold
- 8.3.8. sklearn.cross_validation.ShuffleSplit
- 8.3.9. sklearn.cross_validation.StratifiedShuffleSplit
- 8.3.10. sklearn.cross_validation.train_test_split
- 8.3.11. sklearn.cross_validation.cross_val_score
- 8.3.12. sklearn.cross_validation.permutation_test_score
- 8.3.13. sklearn.cross_validation.check_cv
- 8.4. sklearn.datasets: Datasets
- 8.4.1. Loaders
- 8.4.1.1. sklearn.datasets.fetch_20newsgroups
- 8.4.1.2. sklearn.datasets.fetch_20newsgroups_vectorized
- 8.4.1.3. sklearn.datasets.load_boston
- 8.4.1.4. sklearn.datasets.load_diabetes
- 8.4.1.5. sklearn.datasets.load_digits
- 8.4.1.6. sklearn.datasets.load_files
- 8.4.1.7. sklearn.datasets.load_iris
- 8.4.1.8. sklearn.datasets.load_lfw_pairs
- 8.4.1.9. sklearn.datasets.fetch_lfw_pairs
- 8.4.1.10. sklearn.datasets.load_lfw_people
- 8.4.1.11. sklearn.datasets.fetch_lfw_people
- 8.4.1.12. sklearn.datasets.load_linnerud
- 8.4.1.13. sklearn.datasets.fetch_mldata
- 8.4.1.14. sklearn.datasets.fetch_olivetti_faces
- 8.4.1.15. sklearn.datasets.fetch_california_housing
- 8.4.1.16. sklearn.datasets.load_sample_image
- 8.4.1.17. sklearn.datasets.load_sample_images
- 8.4.1.18. sklearn.datasets.load_svmlight_file
- 8.4.1.19. sklearn.datasets.dump_svmlight_file
- 8.4.2. Samples generator
- 8.4.2.1. sklearn.datasets.make_blobs
- 8.4.2.2. sklearn.datasets.make_classification
- 8.4.2.3. sklearn.datasets.make_circles
- 8.4.2.4. sklearn.datasets.make_friedman1
- 8.4.2.5. sklearn.datasets.make_friedman2
- 8.4.2.6. sklearn.datasets.make_friedman3
- 8.4.2.7. sklearn.datasets.make_hastie_10_2
- 8.4.2.8. sklearn.datasets.make_low_rank_matrix
- 8.4.2.9. sklearn.datasets.make_moons
- 8.4.2.10. sklearn.datasets.make_multilabel_classification
- 8.4.2.11. sklearn.datasets.make_regression
- 8.4.2.12. sklearn.datasets.make_s_curve
- 8.4.2.13. sklearn.datasets.make_sparse_coded_signal
- 8.4.2.14. sklearn.datasets.make_sparse_spd_matrix
- 8.4.2.15. sklearn.datasets.make_sparse_uncorrelated
- 8.4.2.16. sklearn.datasets.make_spd_matrix
- 8.4.2.17. sklearn.datasets.make_swiss_roll
- 8.4.1. Loaders
- 8.5. sklearn.decomposition: Matrix Decomposition
- 8.5.1. sklearn.decomposition.PCA
- 8.5.2. sklearn.decomposition.ProbabilisticPCA
- 8.5.3. sklearn.decomposition.ProjectedGradientNMF
- 8.5.4. sklearn.decomposition.RandomizedPCA
- 8.5.5. sklearn.decomposition.KernelPCA
- 8.5.6. sklearn.decomposition.FactorAnalysis
- 8.5.7. sklearn.decomposition.FastICA
- 8.5.8. sklearn.decomposition.NMF
- 8.5.9. sklearn.decomposition.SparsePCA
- 8.5.10. sklearn.decomposition.MiniBatchSparsePCA
- 8.5.11. sklearn.decomposition.SparseCoder
- 8.5.12. sklearn.decomposition.DictionaryLearning
- 8.5.13. sklearn.decomposition.MiniBatchDictionaryLearning
- 8.5.14. sklearn.decomposition.fastica
- 8.5.15. sklearn.decomposition.dict_learning
- 8.5.16. sklearn.decomposition.dict_learning_online
- 8.5.17. sklearn.decomposition.sparse_encode
- 8.6. sklearn.dummy: Dummy estimators
- 8.7. sklearn.ensemble: Ensemble Methods
- 8.7.1. sklearn.ensemble.RandomForestClassifier
- 8.7.2. sklearn.ensemble.RandomTreesEmbedding
- 8.7.3. sklearn.ensemble.RandomForestRegressor
- 8.7.4. sklearn.ensemble.ExtraTreesClassifier
- 8.7.5. sklearn.ensemble.ExtraTreesRegressor
- 8.7.6. sklearn.ensemble.GradientBoostingClassifier
- 8.7.7. sklearn.ensemble.GradientBoostingRegressor
- 8.7.8. partial dependence
- 8.8. sklearn.feature_extraction: Feature Extraction
- 8.9. sklearn.feature_selection: Feature Selection
- 8.9.1. sklearn.feature_selection.SelectPercentile
- 8.9.2. sklearn.feature_selection.SelectKBest
- 8.9.3. sklearn.feature_selection.SelectFpr
- 8.9.4. sklearn.feature_selection.SelectFdr
- 8.9.5. sklearn.feature_selection.SelectFwe
- 8.9.6. sklearn.feature_selection.RFE
- 8.9.7. sklearn.feature_selection.RFECV
- 8.9.8. sklearn.feature_selection.chi2
- 8.9.9. sklearn.feature_selection.f_classif
- 8.9.10. sklearn.feature_selection.f_regression
- 8.10. sklearn.gaussian_process: Gaussian Processes
- 8.10.1. sklearn.gaussian_process.GaussianProcess
- 8.10.2. sklearn.gaussian_process.correlation_models.absolute_exponential
- 8.10.3. sklearn.gaussian_process.correlation_models.squared_exponential
- 8.10.4. sklearn.gaussian_process.correlation_models.generalized_exponential
- 8.10.5. sklearn.gaussian_process.correlation_models.pure_nugget
- 8.10.6. sklearn.gaussian_process.correlation_models.cubic
- 8.10.7. sklearn.gaussian_process.correlation_models.linear
- 8.10.8. sklearn.gaussian_process.regression_models.constant
- 8.10.9. sklearn.gaussian_process.regression_models.linear
- 8.10.10. sklearn.gaussian_process.regression_models.quadratic
- 8.11. sklearn.grid_search: Grid Search
- 8.12. sklearn.hmm: Hidden Markov Models
- 8.13. sklearn.isotonic: Isotonic regression
- 8.14. sklearn.kernel_approximation Kernel Approximation
- 8.15. sklearn.semi_supervised Semi-Supervised Learning
- 8.16. sklearn.lda: Linear Discriminant Analysis
- 8.17. sklearn.linear_model: Generalized Linear Models
- 8.17.1. sklearn.linear_model.ARDRegression
- 8.17.2. sklearn.linear_model.BayesianRidge
- 8.17.3. sklearn.linear_model.ElasticNet
- 8.17.4. sklearn.linear_model.ElasticNetCV
- 8.17.5. sklearn.linear_model.Lars
- 8.17.6. sklearn.linear_model.LarsCV
- 8.17.7. sklearn.linear_model.Lasso
- 8.17.8. sklearn.linear_model.LassoCV
- 8.17.9. sklearn.linear_model.LassoLars
- 8.17.10. sklearn.linear_model.LassoLarsCV
- 8.17.11. sklearn.linear_model.LassoLarsIC
- 8.17.12. sklearn.linear_model.LinearRegression
- 8.17.13. sklearn.linear_model.LogisticRegression
- 8.17.14. sklearn.linear_model.MultiTaskLasso
- 8.17.15. sklearn.linear_model.MultiTaskElasticNet
- 8.17.16. sklearn.linear_model.OrthogonalMatchingPursuit
- 8.17.17. sklearn.linear_model.PassiveAggressiveClassifier
- 8.17.18. sklearn.linear_model.PassiveAggressiveRegressor
- 8.17.19. sklearn.linear_model.Perceptron
- 8.17.20. sklearn.linear_model.RandomizedLasso
- 8.17.21. sklearn.linear_model.RandomizedLogisticRegression
- 8.17.22. sklearn.linear_model.Ridge
- 8.17.23. sklearn.linear_model.RidgeClassifier
- 8.17.24. sklearn.linear_model.RidgeClassifierCV
- 8.17.25. sklearn.linear_model.RidgeCV
- 8.17.26. sklearn.linear_model.SGDClassifier
- 8.17.27. sklearn.linear_model.SGDRegressor
- 8.17.28. sklearn.linear_model.lars_path
- 8.17.29. sklearn.linear_model.lasso_path
- 8.17.30. sklearn.linear_model.lasso_stability_path
- 8.17.31. sklearn.linear_model.orthogonal_mp
- 8.17.32. sklearn.linear_model.orthogonal_mp_gram
- 8.18. sklearn.manifold: Manifold Learning
- 8.19. sklearn.metrics: Metrics
- 8.19.1. Classification metrics
- 8.19.1.1. sklearn.metrics.accuracy_score
- 8.19.1.2. sklearn.metrics.auc
- 8.19.1.3. sklearn.metrics.auc_score
- 8.19.1.4. sklearn.metrics.average_precision_score
- 8.19.1.5. sklearn.metrics.classification_report
- 8.19.1.6. sklearn.metrics.confusion_matrix
- 8.19.1.7. sklearn.metrics.f1_score
- 8.19.1.8. sklearn.metrics.fbeta_score
- 8.19.1.9. sklearn.metrics.hinge_loss
- 8.19.1.10. sklearn.metrics.matthews_corrcoef
- 8.19.1.11. sklearn.metrics.precision_recall_curve
- 8.19.1.12. sklearn.metrics.precision_recall_fscore_support
- 8.19.1.13. sklearn.metrics.precision_score
- 8.19.1.14. sklearn.metrics.recall_score
- 8.19.1.15. sklearn.metrics.roc_curve
- 8.19.1.16. sklearn.metrics.zero_one_loss
- 8.19.2. Regression metrics
- 8.19.3. Clustering metrics
- 8.19.3.1. sklearn.metrics.adjusted_mutual_info_score
- 8.19.3.2. sklearn.metrics.adjusted_rand_score
- 8.19.3.3. sklearn.metrics.completeness_score
- 8.19.3.4. sklearn.metrics.homogeneity_completeness_v_measure
- 8.19.3.5. sklearn.metrics.homogeneity_score
- 8.19.3.6. sklearn.metrics.mutual_info_score
- 8.19.3.7. sklearn.metrics.normalized_mutual_info_score
- 8.19.3.8. sklearn.metrics.silhouette_score
- 8.19.3.9. sklearn.metrics.silhouette_samples
- 8.19.3.10. sklearn.metrics.v_measure_score
- 8.19.4. Pairwise metrics
- 8.19.4.1. sklearn.metrics.pairwise.additive_chi2_kernel
- 8.19.4.2. sklearn.metrics.pairwise.chi2_kernel
- 8.19.4.3. sklearn.metrics.pairwise.distance_metrics
- 8.19.4.4. sklearn.metrics.pairwise.euclidean_distances
- 8.19.4.5. sklearn.metrics.pairwise.kernel_metrics
- 8.19.4.6. sklearn.metrics.pairwise.linear_kernel
- 8.19.4.7. sklearn.metrics.pairwise.manhattan_distances
- 8.19.4.8. sklearn.metrics.pairwise.pairwise_distances
- 8.19.4.9. sklearn.metrics.pairwise.pairwise_kernels
- 8.19.4.10. sklearn.metrics.pairwise.polynomial_kernel
- 8.19.4.11. sklearn.metrics.pairwise.rbf_kernel
- 8.19.1. Classification metrics
- 8.20. sklearn.mixture: Gaussian Mixture Models
- 8.21. sklearn.multiclass: Multiclass and multilabel classification
- 8.21.1. Multiclass and multilabel classification strategies
- 8.21.2. sklearn.multiclass.OneVsRestClassifier
- 8.21.3. sklearn.multiclass.OneVsOneClassifier
- 8.21.4. sklearn.multiclass.OutputCodeClassifier
- 8.21.5. sklearn.multiclass.fit_ovr
- 8.21.6. sklearn.multiclass.predict_ovr
- 8.21.7. sklearn.multiclass.fit_ovo
- 8.21.8. sklearn.multiclass.predict_ovo
- 8.21.9. sklearn.multiclass.fit_ecoc
- 8.21.10. sklearn.multiclass.predict_ecoc
- 8.22. sklearn.naive_bayes: Naive Bayes
- 8.23. sklearn.neighbors: Nearest Neighbors
- 8.23.1. sklearn.neighbors.NearestNeighbors
- 8.23.2. sklearn.neighbors.KNeighborsClassifier
- 8.23.3. sklearn.neighbors.RadiusNeighborsClassifier
- 8.23.4. sklearn.neighbors.KNeighborsRegressor
- 8.23.5. sklearn.neighbors.RadiusNeighborsRegressor
- 8.23.6. sklearn.neighbors.BallTree
- 8.23.7. sklearn.neighbors.NearestCentroid
- 8.23.8. sklearn.neighbors.kneighbors_graph
- 8.23.9. sklearn.neighbors.radius_neighbors_graph
- 8.24. sklearn.pls: Partial Least Squares
- 8.25. sklearn.pipeline: Pipeline
- 8.26. sklearn.preprocessing: Preprocessing and Normalization
- 8.26.1. sklearn.preprocessing.Binarizer
- 8.26.2. sklearn.preprocessing.KernelCenterer
- 8.26.3. sklearn.preprocessing.LabelBinarizer
- 8.26.4. sklearn.preprocessing.LabelEncoder
- 8.26.5. sklearn.preprocessing.MinMaxScaler
- 8.26.6. sklearn.preprocessing.Normalizer
- 8.26.7. sklearn.preprocessing.OneHotEncoder
- 8.26.8. sklearn.preprocessing.StandardScaler
- 8.26.9. sklearn.preprocessing.add_dummy_feature
- 8.26.10. sklearn.preprocessing.balance_weights
- 8.26.11. sklearn.preprocessing.binarize
- 8.26.12. sklearn.preprocessing.normalize
- 8.26.13. sklearn.preprocessing.scale
- 8.27. sklearn.qda: Quadratic Discriminant Analysis
- 8.28. sklearn.random_projection: Random projection
- 8.29. sklearn.svm: Support Vector Machines
- 8.30. sklearn.tree: Decision Trees
- 8.31. sklearn.utils: Utilities
Example Gallery¶
- Examples
- General examples
- Examples based on real world datasets
- Clustering
- Covariance estimation
- Dataset examples
- Decomposition
- Ensemble methods
- Tutorial exercises
- Gaussian Process for Machine Learning
- Generalized Linear Models
- Manifold learning
- Gaussian Mixture Models
- Nearest Neighbors
- Semi Supervised Classification
- Support Vector Machines
- Decision Trees
Development¶
- 1. Contributing
- 2. How to optimize for speed
- 3. Utilities for Developers
- 4. Developers’ Tips for Debugging
- 5. Maintainer / core-developer information
- 6. About us