Skip to main content

Natural Breaks in Data

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque ante quam, ultrices nec vulputate et, egestas et quam. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque vestibulum, nunc in finibus condimentum.

This post tries to give a clear picture of what this obscure handy tool is all about.

The Jenks optimization method, also called the Jenks natural breaks classification method, is one of the data clustering methods designed to determine the best arrangement of values into different classes. But before going any further, let’s look at what “Natural Breaks” mean.

Natural Breaks: “Natural breaks” are the best way to split up ranges. Best ranges imply the ranges where like areas are grouped together. This method minimizes the variation within each range, so the areas within each range are as close as possible in value to each other.

Intuition: The Jenks natural breaks algorithm, just like K-means, assigns data to one of K groups such that the within-group distances are minimized. Also just like K-means, one must select K prior to running the algorithm.

Why is it not a good idea to do it manually: It’s usually impractical as there will be an overwhelming number of different ways to set ranges and inaccurate as it destroys the objective display of data. Of the few patterns the user can test, the “prettiest” pattern will almost certainly be selected, but that has nothing to do with the correct display of the data.

Algorithm under the hood:

Let’s look at an example to understand how the algorithm works. Say our list of values is [4, 5, 9, 10] and we need to find the best ranges from that.

Step 1: Calculate the “sum of squared deviations for array mean” (SDAM).

list = [4, 5, 9, 10]
mean = 7  #(4 + 5 + 9 + 10) / 4
SDAM = (4-7)^2 + (5-7)^2 + (9-7)^2 + (10-7)^2 = 9 + 4 + 4 + 9 = 26

Step 2: For each range combination, calculate “sum of squared deviations for class means” (SDCM_ALL), and find the smallest one. SDCM_ALL is similar to SDAM but uses class means and deviations.

"""
For [4][5,9,10]
SDCM_ALL = (4-4)^2+(5-8)^2+(9-8)^2+(10-8)^2 = 0 + 9 + 1 + 4 = 14For [4,5][9,10]
SDCM_ALL = (4-4.5)^2+(5-4.5)^2+(9-9.5)^2+(10-9.5)^2 = 0.25 + 0.25 + 0.25 + 0.25 = 1.For [4,5,9][10]
SDCM_ALL = (4-6)^2+(5-6)^2+(9-6)^2+(10-10)^2 = 4 + 1 + 9 + 0 = 14.
"""

Observe that the middle one is having the lowest SDCM implying minimum variance.

Step 3: As a final summary measure, calculate a “goodness of variance fit” (GVF), defined as (SDAM — SCDM) / SDAM. GVF ranges from 1 (perfect fit) to 0 (awful fit).

"""
GVF for [4,5][9,10] is (26 - 1) / 26 = 25 / 26 = 0.96 
GVF for the other 2 ranges is (26 - 14) / 26 = 12 / 26 = 0.46
"""

GVF for [4,5][9,10] is the highest indicating that this combination is the best ranges for the list[4, 5, 9, 10] which makes sense intuitively.

Things to know: It’s a data-intensive algorithm. Take an example of splitting 254 items into 6 ranges. there are 8,301,429,675 possible range combinations. It might take the computer a little while to test so many combinations. So it’s usually a better practice to start with a low number of ranges, and only increase the number to a large number if needed.

Try not to become a man of success, but rather try to become a man of value.

Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean.

Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar.

Try not to become a man of success, but rather try to become a man of value.

Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean.

Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar.

Close Menu

The Castle

Unit 345
2500 Castle Dr
Manhattan, NY

 

T: +216 (0)40 3629 4753
E: hello@themenectar.com