Use Python for Data Cleaning and Analyzing nicotine dependence

The objective of the project was to study Nicotine Dependence. Nicotine Dependence is an addiction to tobacco products caused by the drug nicotine. At first association between smoking behaviour and Nicotine Dependence was explored. Social behaviour and depression history were then analyzed as explanatory variables. A study at Universiti Teknologi MARA Malaysia concluded that among smokers 5% have high nicotine dependence whereas 18% have moderate nicotine dependence. Another study by the Durham University , UK found that life events and stressful experiences are associated with nicotine dependence.

The data was taken from a survey conducted by National Epidemiologic Survey on Alcohol and Related Conditions (NESARC). Earlier National Institute on Alcohol Abuse and Alcoholism have performed landmark study on Alcohol Abuse and Alcohol Dependence using the NESARC dataset. The dataset is a representative sample of United States adult population aged over 18. Initially the data consisted of over 43000 data points.

The data was in Comma Separated Value(CSV) format. The variables used for the regression model were.

Variable Name	Description
CHECK321	the cigarette smoking status of the individual
S3AQ3B1	the usual frequency when the individual smoked cigarettes
S3AQ3C1	the usual quantity when the individual smoked
TAB12MDX	the dependent variable which represented if the respondent had a nicotine dependence in the last 12 months

The dataset contained various missing values. Python was the language used for the data cleaning and analysis purpose. At first using the pandas library in python the dataset was loaded. After loading the the data missing values of the CHECK321 variable were checked. The unknown values of the CHECK321 suggested that the cigarette smoking status of an individual is unknown and so the analysis can not be done for that individual. Hence the data points containing missing values of CHECK321 were deleted. After deleting the missing values we were left with 18000 data points. The target variable TAB12MDX took only two values ‘0’ and ‘1’. ‘1’ denoting the individual is nicotine dependent and ‘0’ denoting that the individual is not nicotine dependent.

For analysis of the data we split the data and into training and testing set. Training data was used to train our model. Since the target variable is binary we decided to use logistic regression for training the model. For splitting the data we used a python library Scikit Learn also popular as sklearn. The training test split ratio was 80:20. The cleaned data was fed to the model and the model was trained. After training the model we tested the model with the test data we had split earlier. We saw that for 58% of the times our model has correctly predicted whether a person is nicotine dependent.

As the study by the Durham University , UK found that life events and stressful experiences are also associated with nicotine dependence. We decided to introduce two new variables MAJORDEPLIFE and SOCPDLIFE to make our model more accurate.

Variable Name	Description
MAJORDEPLIFE	If the individual suffered depression or not
SOCPDLIFE	If the individual has socio-phobia or not

We again split our data into training and testing sets. The model was trained with the testing set now with two new variables. Now with the introduction of two new variables associated with the life events and stressful experience the accuracy of our model increased to 65%. For 65% of the times we predicted correctly whether a person is nicotine dependent or not by seeing is smoking behaviour and social life.
Research concludes that individual who smoke more heavily than those who don’t are more prone to be Nicotine Dependent. Individuals who have a history of Depression or are Socio-phobic are also more prone to be Nicotine Dependent than those without.Our results seems consistent with the previous findings.