Contents

Pandas cut and qcut Functions

When we have continuous numerical values, we can discretize them using cut and qcut. The cut function bins values by numeric intervals, while qcut bins them by quantiles. In other words, cut produces bins of equal length, while qcut produces bins of equal size.

Suppose we have the ages of a group of people:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32, 101]
If we want to discretize this list into “18 to 25”, “25 to 35”, “35 to 60”, and “60 and above”, we can use the cut function:

bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
output:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (60, 100], (35, 60], (35, 60], (25, 35], NaN]
Length: 13
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The first list shows which bin each age falls into. Values outside the bins become NaN. cats has two attributes:

cats.labels
output:
array([ 0,  0,  0,  1,  0,  0,  2,  1,  3,  2,  2,  1, -1], dtype=int8)

We can also assign labels to each range, for example:

group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)
output:
[Youth, Youth, Youth, YoungAdult, Youth, ..., Senior, MiddleAged, MiddleAged, YoungAdult, NaN]
Length: 13
Categories (4, object): [MiddleAged < Senior < YoungAdult < Youth]
data = np.random.randn(1000) # Gaussian distribution
cats = pd.qcut(data, 4) # Bin by quartiles; you can also pass [0, .25, .5, .75, 1.]
cats
output:
[(0.624, 3.928], (-0.691, -0.0144], (-0.691, -0.0144], (-0.0144, 0.624], (0.624, 3.928], ..., (-0.0144, 0.624], (-0.0144, 0.624], [-2.949, -0.691], (-0.0144, 0.624], (0.624, 3.928]] Length: 1000 Categories (4, object): [[-2.949, -0.691] < (-0.691, -0.0144] < (-0.0144, 0.624] < (0.624, 3.928]]
pd.value_counts(cats) # Count the number of values in each bin
output:
(0.624, 3.928]       250 
(-0.0144, 0.624]     250 
(-0.691, -0.0144]    250 
[-2.949, -0.691]     250 
dtype: int64

You’ll notice that qcut distributes all values evenly. If you don’t want quartiles, you can pass your own list — any values between 0 and 1 will work, for example [0, 0.1, 0.5, 0.9, 1.].