Wednesday, October 23, 2013

Python and Data Science

I was exploring Linkedin profiles of well known data scientists like Jeff Hammerbacher,  Hilary Mason, DJ Patil, Gilad Lotan to get the idea about their technical skill-set. The first common thing I could find was Python. So I decided to explore about general projects and data mining/ machine learning libraries associated with Python.

General Python Projects:

Python: A general purpose high-level programming language. Python supports multiple programming paradigms, like object-oriented, imperative and functional programming or procedural styles. [1] Python implementation is under open source license that makes it freely usable and distributable, even for commercial use. [2]
Created by: Guido van Rossum

CPython: It is the default, most widely used implementation of Python.
Written in: C
Maintained by: Python core developers and the Python community, supported by the Python Software Foundation

Difference between Python and CPython: Python is programming language and CPython is default implementation of it. So when we generally refer python programming language we are talking about CPython. There are several other implementations as well like Jython, IronPython etc.

Jython: Implementation of Python in Java. It has several differences and incompatibilities with CPython.
Written in: Java and Python
Successor of : JPython

RPython and PyPy: RPython (restricted python), is restricted subset of python. PyPy is interpreter which is written in RPython.
Project goal: Speed, efficiency and compatibility of CPython interpreter.

IronPython: It is Python implementation targeted at .NET framework. 
Written in: C#
Created by: Jim Hugunin
Currently maintained by: Volunteers at Microsoft's CodePlex open-source repository

Cython: It enables to write Python code which can be called back and forth, from and to C or C++ code natively. It is nothing but C extension for python.

IPython: It is interactive python. Motivation is scientific imputing and exploratory analysis, where we can directly play with data/ files. Default interactive environment is having limited functionality issue which can be solved by IPython.
Created by: Fernando Perez and others.
Getting started: IPython: Python at your fingertips (talk at Pycon 2012 by IPython creators)




Specific project related to Data (Processing, Mining, Visualization and Machine Learning):

SciPy Ecosystem: It is computing environment and open source ecosystem of Python packages used by scientists, analysts and engineers for performing scientific and technical computing.
  • Pandas: Python library providing high-performance, easy-to-use data structures and data analysis tools. 
  • NumPy: NumPy is python library that supports large, multi-dimensional arrays and high-level mathematical functions to perform various operations on these arrays. [written in: C and Python]
  • SciPy: It also refers to a python package (library) of algorithms and mathematical functions which is a core element of the SciPy environment for technical computing.
  • matplotlib: A python library which is used for 2D plotting (used for creation of various types graphs and charts)
  • IPython: The IPython project mentioned above is also part of core SciPy stack.
  • scikit: It is another python package for scientific computing. This is not a core part of SciPy but add-on package.

StatsModels: It is a python module that enables users to explore data, estimate statistical models and perform statistical tests.

scikit-learn: Open source machine learning library build on top of NumPy, SciPy and matplotlib. Note that it is different from Scikit.

PypeR: It enables us to use R (most preferred language of data scientists) in Python through PIPE.

NetworkX: Python package for the creation, manipulation and study of the structure and functions of complex networks.


Evolution of python data science ecosystem:

Find great answer below about how python data mining/machine learning ecosystem evolved by Jeff Hammerbacher from quora.

"The Python community invested in the mid-1990s in Numeric which later evolved into NumPy. After  few years, the plotting functionality from Matlab was brought to Python called matplotlib. Libraries for scientific computing were built around NumPy and matplotlib and bundled into the SciPy package. From R, the data frame and associated manipulations (from the plyr and reshape packages) have been implemented by the pandas library. The scikit-learn project gives a common interface of machine learning algorithms, similar to the caret package in R."



No comments:

Post a Comment