Natural Breaks in Data

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque ante quam, ultrices nec vulputate et, egestas et quam. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque vestibulum, nunc in finibus condimentum.

This post tries to give a clear picture of what this obscure handy tool is all about.

The Jenks optimization method, also called the Jenks natural breaks classification method, is one of the data clustering methods designed to determine the best arrangement of values into different classes. But before going any further, let’s look at what “Natural Breaks” mean.

Natural Breaks: “Natural breaks” are the best way to split up ranges. Best ranges imply the ranges where like areas are grouped together. This method minimizes the variation within each range, so the areas within each range are as close as possible in value to each other.

Intuition: The Jenks natural breaks algorithm, just like K-means, assigns data to one of K groups such that the within-group distances are minimized. Also just like K-means, one must select K prior to running the algorithm.

Why is it not a good idea to do it manually: It’s usually impractical as there will be an overwhelming number of different ways to set ranges and inaccurate as it destroys the objective display of data. Of the few patterns the user can test, the “prettiest” pattern will almost certainly be selected, but that has nothing to do with the correct display of the data.

Algorithm under the hood:

Let’s look at an example to understand how the algorithm works. Say our list of values is [4, 5, 9, 10] and we need to find the best ranges from that.

Step 1: Calculate the “sum of squared deviations for array mean” (SDAM).

list = [4, 5, 9, 10]
mean = 7  #(4 + 5 + 9 + 10) / 4
SDAM = (4-7)^2 + (5-7)^2 + (9-7)^2 + (10-7)^2 = 9 + 4 + 4 + 9 = 26

Step 2: For each range combination, calculate “sum of squared deviations for class means” (SDCM_ALL), and find the smallest one. SDCM_ALL is similar to SDAM but uses class means and deviations.

"""
For [4][5,9,10]
SDCM_ALL = (4-4)^2+(5-8)^2+(9-8)^2+(10-8)^2 = 0 + 9 + 1 + 4 = 14For [4,5][9,10]
SDCM_ALL = (4-4.5)^2+(5-4.5)^2+(9-9.5)^2+(10-9.5)^2 = 0.25 + 0.25 + 0.25 + 0.25 = 1.For [4,5,9][10]
SDCM_ALL = (4-6)^2+(5-6)^2+(9-6)^2+(10-10)^2 = 4 + 1 + 9 + 0 = 14.
"""

Observe that the middle one is having the lowest SDCM implying minimum variance.

Step 3: As a final summary measure, calculate a “goodness of variance fit” (GVF), defined as (SDAM — SCDM) / SDAM. GVF ranges from 1 (perfect fit) to 0 (awful fit).

"""
GVF for [4,5][9,10] is (26 - 1) / 26 = 25 / 26 = 0.96 
GVF for the other 2 ranges is (26 - 14) / 26 = 12 / 26 = 0.46
"""

GVF for [4,5][9,10] is the highest indicating that this combination is the best ranges for the list[4, 5, 9, 10] which makes sense intuitively.

Things to know: It’s a data-intensive algorithm. Take an example of splitting 254 items into 6 ranges. there are 8,301,429,675 possible range combinations. It might take the computer a little while to test so many combinations. So it’s usually a better practice to start with a low number of ranges, and only increase the number to a large number if needed.

Usage in code:

pip install jenks

Below is a function to calculate the Goodness of Variance Fit given an array of values to classify and the number of classes selected:

from jenks import jenks
import numpy as np
def goodness_of_variance_fit(array, classes):
    # get the break points
    classes = jenks(array, classes)

    # do the actual classification
    classified = np.array([classify(i, classes) for i in array])

    # max value of zones
    maxz = max(classified)

    # nested list of zone indices
    zone_indices = [[idx for idx, val in enumerate(classified) if zone + 1 == val] for zone in range(maxz)]

    # sum of squared deviations from array mean
    sdam = np.sum((array - array.mean()) ** 2)

    # sorted polygon stats
    array_sort = [np.array([array[index] for index in zone]) for zone in zone_indices]

    # sum of squared deviations of class means
    sdcm = sum([np.sum((classified - classified.mean()) ** 2) for classified in array_sort])

    # goodness of variance fit
    gvf = (sdam - sdcm) / sdam

    return gvf

def classify(value, breaks):
    for i in range(1, len(breaks)):
        if value < breaks[i]:
            return i
    return len(breaks) - 1

For example, consider you decide the GVF should be at least .8, then you could increment the number of classes until the GVF is satisfied:

gvf = 0.0
nclasses = 2
while gvf < .8:
    gvf = goodness_of_variance_fit(array, nclasses)
    nclasses += 1

Try not to become a man of success, but rather try to become a man of value.

Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean.

Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar.

Natural Breaks in Data

Best Features Provided

Usage with Code

Scorching Hot Design

Algorithm under the hood:

Try not to become a man of success, but rather try to become a man of value.

Try not to become a man of success, but rather try to become a man of value.

The Castle