Posted on March 4, 2017July 3, 2017 by Abhilash Kumar

Use Python to get more than 0.75 accuracy in Titanic problem by Kaggle

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

The problem statement was to complete the analysis of what sorts of people were likely to survive and apply the machine learning tools to predict which passengers survived the tragedy. The Kaggle Titanic problem page can be found here. The full solution in python can be found here on github. The data in the problem is given in two CSV files, test.csv and train.csv. The variable used in the data and their description are as follows

Variable Name	Definition
survival	If an individual survived the tragedy or not
pclass	Ticket class
Name	Name of the individual
sex	Sex
Age	Age in years
sibsp	# of siblings/spouses aboard the Titanic
parch	# of parents/children aboard the Titanic
ticket	Ticket Number
fare	Passenger fare
cabin	Cabin Number
embarked	Port of Embarkation

The target variable here is Survived which takes the value 0 or 1. Since the target value takes the binary values we can say that the this is an example of classification problem. We have to train the model using train.csv and predict the outcomes using test.csv and submit the predictions on kaggle. At first we will read the data from CSV file using the pandas library in python, for that we have to import pandas library. The data is loaded in pandas dataframe format. We can see how the data looks like using dataframe.head() command. We can also see the columns using dataframe.columns command. We can check the number of rows and columns of the dataframe using the dataframe.shape command.

import pandas as pd
dataframe=pd.read_csv('train.csv')
print(dataframe.head(10))
print(dataframe.columns)
print(dataframe.shape)

The ‘Name’ variable present here contains the full name of the person. In this format the ‘Name’ variable is not useful for us as almost all the entries here will be unique and that would not help in our prediction. We can make this useful by trimming and partitioning the string to get the title and first name of the individual.

dataframe['Name']=dataframe['Name'].str.partition(',')[2]
dataframe['Name']=dataframe['Name'].str.partition(',')[0]

We can check the values taken by a variable by values_count() function. We can also check whether a variable contains NaN(Not a Number) values or not. After checking value_counts for age variable we found out that the age variable contained NaN values. We can wither drop the NaN values using dropna() function or we can impute the values. In this example we will impute the missing values using MICE(Multiple Imputation by Chained Equations). To use MICE function we have to import a python library called ‘fancyimpute’. Mice uses the other variables to impute the missing values and iterate it till the value converges such that our imputed value balances the bias and variance of that variable.

print(dataframe['Age'].value_counts(dropna=False)
from fancyimpute import MICE
solver=MICE()
Imputed_dataframe=solver.complete(dataframe.values)

Since this is a classification problem we can solve it by methods such as Logistic Regression, Decision Tree, Random Forest etc. Here we will use Random Forest to solve this problem. Variables present here are categorical and continuous. Some of the categorical variables are in the string format. So we have to first change that in the integer categorical format so that we can feed the data directly for our model to learn. We can use factorize function from pandas library to do this.

cols=['Name','Ticket','Pclass','Fare','Embarked','Cabin']
for x in cols:
    a=pd.factorize(dataframe[x])
    dataframe[x]=a[0]

We can create an addition variable named ‘family_size’ using the variables ‘parch’ and ‘sibsp’. Now our data is ready and we can train our model using the data. We will first divide the data into target variables and and features as Y and X. We will train the model using X and Y.

dataframe['family_size']=dataframe['SibSp'] + dataframe['Parch'] + 1
y=dataframe[['Survived']].values
x=dataframe.values
import numpy as np
x=np.delete(dataframe,0,axis=1)
model=RandomForestClassifier(n_estimators=100,criterion='gini',max_depth=10,min_samples_split=10)
model=model.fit(x,y)

Now our model has been trained. We can use this model to predict whether a person survived the tragedy or not. Now we will read the test.csv file and use the variables present as test features for prediction.

test=pd.read_csv('test.csv')
test['Name']=test['Name'].str.partition(',')[2]
test['Name']=test['Name'].str.partition('.')[0]
cols=['Name','Ticket','Pclass','Fare','Embarked','Cabin']
for x in cols:
    a=pd.factorize(test[x])
    test[x]=a[0]
test_features=test.values
my_prediction = model.predict(test_features)

Now we have our predictions in ‘my_prediction’ variable. Now we are ready to submit the solution to kaggle. For kaggle submission a specific format is given for problems. We will write our solution in a CSV file in suitable format which we can submit at kaggle.

PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
my_solution.to_csv("my_solution.csv", index_label = ["PassengerId"])

Now we have our solution in a CSV file ‘my_solution.csv’. We can submit this file on to kaggle submission page and check our accuracy.

Posted on January 28, 2017July 11, 2017 by Abhilash Kumar

Use Python for Data Cleaning and Analyzing nicotine dependence

The objective of the project was to study Nicotine Dependence. Nicotine Dependence is an addiction to tobacco products caused by the drug nicotine. At first association between smoking behaviour and Nicotine Dependence was explored. Social behaviour and depression history were then analyzed as explanatory variables. A study at Universiti Teknologi MARA Malaysia concluded that among smokers 5% have high nicotine dependence whereas 18% have moderate nicotine dependence. Another study by the Durham University , UK found that life events and stressful experiences are associated with nicotine dependence.

The data was taken from a survey conducted by National Epidemiologic Survey on Alcohol and Related Conditions (NESARC). Earlier National Institute on Alcohol Abuse and Alcoholism have performed landmark study on Alcohol Abuse and Alcohol Dependence using the NESARC dataset. The dataset is a representative sample of United States adult population aged over 18. Initially the data consisted of over 43000 data points.

The data was in Comma Separated Value(CSV) format. The variables used for the regression model were.

Variable Name	Description
CHECK321	the cigarette smoking status of the individual
S3AQ3B1	the usual frequency when the individual smoked cigarettes
S3AQ3C1	the usual quantity when the individual smoked
TAB12MDX	the dependent variable which represented if the respondent had a nicotine dependence in the last 12 months

The dataset contained various missing values. Python was the language used for the data cleaning and analysis purpose. At first using the pandas library in python the dataset was loaded. After loading the the data missing values of the CHECK321 variable were checked. The unknown values of the CHECK321 suggested that the cigarette smoking status of an individual is unknown and so the analysis can not be done for that individual. Hence the data points containing missing values of CHECK321 were deleted. After deleting the missing values we were left with 18000 data points. The target variable TAB12MDX took only two values ‘0’ and ‘1’. ‘1’ denoting the individual is nicotine dependent and ‘0’ denoting that the individual is not nicotine dependent.

For analysis of the data we split the data and into training and testing set. Training data was used to train our model. Since the target variable is binary we decided to use logistic regression for training the model. For splitting the data we used a python library Scikit Learn also popular as sklearn. The training test split ratio was 80:20. The cleaned data was fed to the model and the model was trained. After training the model we tested the model with the test data we had split earlier. We saw that for 58% of the times our model has correctly predicted whether a person is nicotine dependent.

As the study by the Durham University , UK found that life events and stressful experiences are also associated with nicotine dependence. We decided to introduce two new variables MAJORDEPLIFE and SOCPDLIFE to make our model more accurate.

Variable Name	Description
MAJORDEPLIFE	If the individual suffered depression or not
SOCPDLIFE	If the individual has socio-phobia or not

We again split our data into training and testing sets. The model was trained with the testing set now with two new variables. Now with the introduction of two new variables associated with the life events and stressful experience the accuracy of our model increased to 65%. For 65% of the times we predicted correctly whether a person is nicotine dependent or not by seeing is smoking behaviour and social life.
Research concludes that individual who smoke more heavily than those who don’t are more prone to be Nicotine Dependent. Individuals who have a history of Depression or are Socio-phobic are also more prone to be Nicotine Dependent than those without.Our results seems consistent with the previous findings.

Posted on October 12, 2016October 12, 2016 by Ajay Nayak

3 Advantage of learning D3JS

Learning D3 can take time, especially if you have no prior web development experience. Hence, D3 is probably not for people who want to just quickly expand their visualization skills. Now back to the top 3 reasons for learning D3

1. Lots of examples.

Here’s a secret about creating great data visualizations: take ideas from other examples you’ve liked! That’s often the most effective way to make you look like and become a master data visualizer.

And that’s the great news about D3: there are thousands – thousands! – of great D3 examples to work from.

See excellent curated lists of D3 examples here and here. Many of these examples are posted online because developers want others to re-use their code. Just be sure to give credit where credit is due 🙂

2. Vibrant open-source community.

When I have a question about D3 , I often Google the issue and then quickly find a great StackOverflow or blog post that addresses it. These extensive (free!) resources are available because of the very large and vibrant open-source community behind D3.

3. Knowing D3 = Hirable skills.

Data Science and analytics skills are top trends and when it comes to analytics over web D3JS is preferred because of its powerful libraries and responsive capabilites.

Download the eBook For Free