Notes on the Kaggle Titanic Stacking Model

2017-08-17

While reading through Kaggle kernels for the Titanic challenge, many of them use SVM, RandomForest, LogisticRegression, etc. What makes this particular kernel interesting is that it builds a model from six different learners:
Introduction to Ensembling/Stacking in Python Using data from Titanic: Machine Learning from Disaster
At Level 1 it uses RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, and SVM, and at Level 2 it uses XGBoost. I sketched the overall flow of the model to make it easier to understand — the raw source is hard to parse quickly. The author cleverly uses classes to keep the notebook code clean, which also makes it easier to modify and organize later.

Pandas cut and qcut Functions

2017-08-05

When we have continuous numerical values, we can discretize them using cut and qcut. The cut function bins values by numeric intervals, while qcut bins them by quantiles. In other words, cut produces bins of equal length, while qcut produces bins of equal size. The cut function Suppose we have the ages of a group of people:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32, 101]
If we want to discretize this list into “18 to 25”, “25 to 35”, “35 to 60”, and “60 and above”, we can use the cut function: