How to handle a typical data science issue

Yolanda Johansson
1 min readJun 20, 2021

When one works with a typical data science issue in interview or daily work, such as binary classification, the general process is exploratory data analysis, preprocessing, model building and training, model evaluation, inference. The common python libraries are pandas, scikit-learn, matplotlib, seaborn.

  1. exploratory data analysis: distribution, histogram, correlation, variable importance, plotting and visualization
  2. preprocessing: training and test data splitting, standardization (common requirement for many machine learning estimators implemented in scikit-learn), imputation of missing values, encoding categorical features
  3. model building and training: call various models such as logistic regression, random forests, gradient boosted trees, etc… on training data, one can use k-fold cross validation and grid search in order to reduce overfitting (eg. sklearn.model_selection.GridSearchCV)
  4. model evaluation: usually auc (sklearn.metrics.roc_auc_score, value range [0.5, 1]) can be used to evaluate classification model accuracy, if the data set is highly imbalanced, F1-score is used. https://en.wikipedia.org/wiki/Receiver_operating_characteristic, calculate a metric on test data to determine which model gives the best performance.
  5. inference: this is to apply the model on real data. For use case in credit scoring, the predicted probabilities denote credit value through times 1000. Two options (joblib, pickle) can help to save and load models.

--

--