Posted on

10 python modules you must know about

Python_logoWithout any doubt Python is one of the most talked about programming language in this Universe. It’s everywhere, and because of how simple it is to learn it – many beginners start their career with Python. The syntax of Python is very similar to that of English language, with the exception of a few extra characters here and there. In this post you will know about 10 important python modules you should know.

 

1. NumPy

 

NumPy provides some advance math functionalities to python. This package is mainly designed to efficiently manipulate large multi-dimensional arrays of arbitrary records without sacrificing too much speed for small multi-dimensional arrays. There are also basic facilities for discrete fourier transform, basic linear algebra and random number generation.

 

2. SciPy

 

SciPy is a library of algorithms and mathematical tools for python and has caused many scientists to switch from ruby to python. It is mainly used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, signal and image processing, ODE solvers and other tasks common in science and engineering.

 

3. Pandas

 

Pandas is a Python package designed to do work with “labeled” and “relational” data simple and intuitive. Pandas is a perfect tool for data wrangling. It designed for quick and easy data manipulation, aggregation, and visualization.

 

4. matplotlib

 

matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It is a library mainly used for making 2D plots of arrays in Python. It is very useful for any data scientist or any data analyzer.

 

5. IPython

 

IPython is a command shell for interactive computing in Python and other programming languages, originally developed for the Python programming language, that offers introspection, rich media, shell syntax, tab completion, and history.

 

6. PyGame

 

This is a interesting python module for developers who like to play games and develop them.This library will help you achieve your goal of 2d game development.

 

7. Twisted

 

Twisted is an event-driven networking engine written in Python and licensed under the open source MIT license.It is the most important tool for any network application developer. It has a very beautiful api and is used by a lot of famous python developers.

 

8. PrettyTable

 

As the name already suggests, prettytable is a table printing library which displays the contents of the table as a pretty formatted table on the console.

 

9. nose

 

nose is a testing framework for Python. Their tagline is: “nose extends unit test to make testing easier.” It is used by millions of python developers. It is a must have if you do test driven development.

 

10. Scrapy

 

Scrapy is a free and open source web crawling framework, written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler.

 

 

Posted on

Use Python to get more than 0.75 accuracy in Titanic problem by Kaggle

Ship_image
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

The problem statement was to complete the analysis of what sorts of people were likely to survive and apply the machine learning tools to predict which passengers survived the tragedy. The Kaggle Titanic problem page can be found here. The full solution in python can be found here on github. The data in the problem is given in two CSV files, test.csv and train.csv. The variable used in the data and their description are as follows

Variable Name Definition
survival If an individual survived the tragedy or not
pclass Ticket class
Name Name of the individual
sex Sex
Age Age in years
sibsp # of siblings/spouses aboard the Titanic
parch # of parents/children aboard the Titanic
ticket Ticket Number
fare Passenger fare
cabin Cabin Number
embarked Port of Embarkation

The target variable here is Survived which takes the value 0 or 1. Since the target value takes the binary values we can say that the this is an example of classification problem. We have to train the model using train.csv and predict the outcomes using test.csv and submit the predictions on kaggle. At first we will read the data from CSV file using the pandas library in python, for that we have to import pandas library. The data is loaded in pandas dataframe format. We can see how the data looks like using dataframe.head() command. We can also see the columns using dataframe.columns command. We can check the number of rows and columns of the dataframe using the dataframe.shape command.

import pandas as pd
dataframe=pd.read_csv('train.csv')
print(dataframe.head(10))
print(dataframe.columns)
print(dataframe.shape)

The ‘Name’ variable present here contains the full name of the person. In this format the ‘Name’ variable is not useful for us as almost all the entries here will be unique and that would not help in our prediction. We can make this useful by trimming and partitioning the string to get the title and first name of the individual.

dataframe['Name']=dataframe['Name'].str.partition(',')[2]
dataframe['Name']=dataframe['Name'].str.partition(',')[0]

We can check the values taken by a variable by values_count() function. We can also check whether a variable contains NaN(Not a Number) values or not. After checking value_counts for age variable we found out that the age variable contained NaN values. We can wither drop the NaN values using dropna() function or we can impute the values. In this example we will impute the missing values using MICE(Multiple Imputation by Chained Equations). To use MICE function we have to import a python library called ‘fancyimpute’. Mice uses the other variables to impute the missing values and iterate it till the value converges such that our imputed value balances the bias and variance of that variable.

print(dataframe['Age'].value_counts(dropna=False)
from fancyimpute import MICE
solver=MICE()
Imputed_dataframe=solver.complete(dataframe.values)

Since this is a classification problem we can solve it by methods such as Logistic Regression, Decision Tree, Random Forest etc. Here we will use Random Forest to solve this problem. Variables present here are categorical and continuous. Some of the categorical variables are in the string format. So we have to first change that in the integer categorical format so that we can feed the data directly for our model to learn. We can use factorize function from pandas library to do this.

cols=['Name','Ticket','Pclass','Fare','Embarked','Cabin']
for x in cols:
    a=pd.factorize(dataframe[x])
    dataframe[x]=a[0]

We can create an addition variable named ‘family_size’ using the variables ‘parch’ and ‘sibsp’. Now our data is ready and we can train our model using the data. We will first divide the data into target variables and and features as Y and X. We will train the model using X and Y.

dataframe['family_size']=dataframe['SibSp'] + dataframe['Parch'] + 1
y=dataframe[['Survived']].values
x=dataframe.values
import numpy as np
x=np.delete(dataframe,0,axis=1)
model=RandomForestClassifier(n_estimators=100,criterion='gini',max_depth=10,min_samples_split=10)
model=model.fit(x,y)

Now our model has been trained. We can use this model to predict whether a person survived the tragedy or not. Now we will read the test.csv file and use the variables present as test features for prediction.

test=pd.read_csv('test.csv')
test['Name']=test['Name'].str.partition(',')[2]
test['Name']=test['Name'].str.partition('.')[0]
cols=['Name','Ticket','Pclass','Fare','Embarked','Cabin']
for x in cols:
    a=pd.factorize(test[x])
    test[x]=a[0]
test_features=test.values
my_prediction = model.predict(test_features)

Now we have our predictions in ‘my_prediction’ variable. Now we are ready to submit the solution to kaggle. For kaggle submission a specific format is given for problems. We will write our solution in a CSV file in suitable format which we can submit at kaggle.

PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
my_solution.to_csv("my_solution.csv", index_label = ["PassengerId"])

Now we have our solution in a CSV file ‘my_solution.csv’. We can submit this file on to kaggle submission page and check our accuracy.

 

 

Posted on

Use Python for Data Cleaning and Analyzing nicotine dependence

doc(1)

The objective of the project was to study Nicotine Dependence. Nicotine Dependence is an addiction to tobacco products caused by the drug nicotine. At first association between smoking behaviour and Nicotine Dependence was explored. Social behaviour and depression history were then analyzed as explanatory variables. A study at Universiti Teknologi MARA Malaysia concluded that among smokers 5% have high nicotine dependence whereas 18% have moderate nicotine dependence. Another study by the Durham University , UK found that life events and stressful experiences are associated with nicotine dependence.

The data was taken from a survey conducted by National Epidemiologic Survey on Alcohol and Related Conditions (NESARC). Earlier National Institute on Alcohol Abuse and Alcoholism have performed landmark study on Alcohol Abuse and Alcohol Dependence using the NESARC dataset.  The dataset is a representative sample of United States adult population aged over 18. Initially the data consisted of over 43000 data points.

The data was in Comma Separated Value(CSV) format. The variables used for the regression model were.

Variable Name Description
CHECK321 the cigarette smoking status of the individual
S3AQ3B1 the usual frequency when the individual smoked cigarettes
S3AQ3C1 the usual quantity when the individual smoked
TAB12MDX the dependent variable which represented if the respondent had a nicotine dependence in the last 12 months

The dataset contained various missing values. Python was the language used for the data cleaning and analysis purpose. At first using the pandas library in python the dataset was loaded. After loading the the data missing values of the CHECK321 variable were checked. The unknown values of the CHECK321 suggested that the cigarette smoking status of an individual is unknown and so the analysis can not be done for that individual. Hence the data points containing missing values of CHECK321 were deleted. After deleting the missing values we were left with 18000 data points. The target variable TAB12MDX took only two values ‘0’ and ‘1’. ‘1’ denoting the individual is nicotine dependent and ‘0’ denoting that the individual is not nicotine dependent.

telepain

For analysis of the data we split the data and into training and testing set. Training data was used to train our model. Since the target variable is binary we decided to use logistic regression for training the model. For splitting the data we used a python library Scikit Learn also popular as sklearn. The training test split ratio was 80:20. The cleaned data was fed to the model and the model was trained. After training the model we tested the model with the test data we had split earlier. We saw that for 58% of the times our model has correctly predicted whether a person is nicotine dependent.

As the study by the Durham University , UK found that life events and stressful experiences are also associated with nicotine dependence. We decided to introduce two new variables MAJORDEPLIFE and SOCPDLIFE to make our model more accurate.

 

Variable Name Description
MAJORDEPLIFE If the individual suffered depression or not
SOCPDLIFE If the individual has socio-phobia or not

We again split our data into training and testing sets. The model was trained with the testing set now with two new variables. Now with the introduction of two new variables associated with the life events and stressful experience the accuracy of our model increased to 65%. For 65% of the times we predicted correctly whether a person is nicotine dependent or not by seeing is smoking behaviour and social life.
Research concludes that individual who smoke more heavily than those who don’t are more prone to be Nicotine Dependent. Individuals who have a history of Depression or are Socio-phobic are also more prone to be Nicotine Dependent than those without.Our results seems consistent with the previous findings.