Farzana Patel

Data Scientist| Psychologist| Lead Data Consultant

Experience

OFFICE FOR NATIONAL STATISTICS, UK - Data Scientist (Senior Executive Officer)
OMDENA, REMOTE- Junior Machine Learning Engineer
MARUMA CONSULTANCY, INDIA- Lead Consultant (Data and Analysis)
ACCENTURE, INDIA- Application Development Associate

Education

MSC (Psychological Reaserach Methods with Data Science), The University of Sheffield, UK.
Post-Grad Diploma (DataScience specialization in Deep Learning), IIIT- Bangalore, India.
MA (Psychology), IGNOU, India.
BE (Computer Engineering), MITCOE, India.

Languages and Technological skills

Project 1: World Happiness Report 2021

This project integrates various visulizations pertaining happiness scores across the globe along with other important parameters.

R libraries used: Plotly, Renv, Dplyr, Tidyverse, Ggalt
Input: Country, Year, Happiness Score, Regions
Output: Happines trend during Covid.
Key Insight: There was significant dip in happiness levels across the globe when Covid struck and it didn’t go back to original levels in 2021.

Project 2: Understanding the space of facebook political adverts using topic modelling.

The aim of this project was to understand the space of Facebook political adverts (i.e. by topic), as well as differences by timing, party, type of source (e.g. local political actors, campaigners, national parties) using automated content analysis, will provide meaningful insights of the inner workings of the political ad domain to the community.

In the project, I implemented topic modelling on political advert texts to detect latent topics and temporal trends with the dataset. By conducting a series of experiments, I finally landed on the optimal topic model I used to answer the research questions.

Top and bottom 10 happiest countries

Python libraries used:
numpy==1.20.1
pandas==1.2.4
matplotlib==3.3.4
seaborn==0.11.1
nltk==3.6.1
gensim==3.8.3
spacy==3.1.1
pyLDAvis==2.1.2

Project 3: Credit Card Fraud Detection

This project predicts fraudulent credit card transactions using machine learning models.

Python libraries used: LogisticRegression, roc_auc_score, RandomForestClassifier,KMeans
Input: Vectors, Amount, Class
Output: Recall scores were used to identify best model based on their performance.
Key Insights: Random forest is the best performing model and KNN performed the worst. So, I recommend using random forest to identify fraud transactions with current class imbalance dataset

AUC scores of models Accuracy scores of models Recall scores of models

Project 4: Advanced Statistics- Investigation of knowledge and skill development in a lifetime

In this project, I investigated how people develop skills and knowledge throughout their lifetime. In particular, I investigated how language exposure impacts later linguistic skills, cognitive abilities, and academic achievement. The goal was to make several models, which quantify and test postulated theoretical assumptions. Previous studies in language acquisition showed that language skills depend on the richness of the environment as well as number of other factors. In the case of this exam, I focused on the question of how people learn language and whether this influences other outcomes, such as university enrollment.

R libraries used: tidySEM, lavaan, statmod, semPlot, rcompanion, lme4, lmerTest, sjPlot, glmmTMB
Input: item_id, WordType, MotherVocab, FatherVocab, ReadingTime, ToddlerVocab, RT
Models used: Linear Model, Structural Equation Model (SEM), Generalized Linear Model, Confirmatory Factor Model, Mixed Effect Model
Key Insights:
- Toddler with mother with larger vocabulary will know 0.32 words more than the toddler with mother with smaller vocabulary, by every 1 unit increase in mother’s vocabulary and keeping all other variables constant. Slope of ReadingTime increases by 0.175596 for toddler from high SES background. That is toddlers from high SES background have high positive relationship between ReadingTime and ToddlerVocab as compared to low SES background toddlers.
- Prediction of university enrollment: SES was significant (β=2.129449, p= 2.58e-14). Hence, null of no effect is rejected. So, it can be stated that, SES measure is predictive of enrollment.ReadingTime was significant (β= 0.027840, p= 4.80e-05). Hence, null of no effect is rejected. So, it can be stated that, ReadingTime measure is predictive of enrollment.ToddlerVocab was significant (β= 0.025196, p= 0.0029). Hence, null of no effect is rejected. So, it can be stated that, ToddlerVocab measure is predictive of enrollment.It can be stated that high SES individuals are 8.41 times more likely to enroll at a university.

SEM Model output:
Structural Equation Model

Generalized Model prediction:
GLM prediction

Full GLM prediction:

Project 5: Car Price Prediction

An automobile consulting company wants to understand the factors on which the pricing of cars depends.The model prediction will be used by management to understand the pricing dynamics of a new market.

Python libraries used: LinearRegression, RFE, sm
Input: Car price, car types, car specifications
Output: Linear model that explains the most variance with significant predictors.
Key Insights: Linear model fits the best here with 90% variance explained by r-squared(0.899) and adjusted r-squared (0.896).P-values of the model is less than 0.05 which validates that predictors are statistically significant

Scatterplot of all numeric datapoints:

Residual plot:
residuals

Linear Model prediction:

Let's connect and chat! Open to anything under the sun.