This post tries to give a clear picture of what this obscure handy tool is all about.

The **Jenks optimization method**, also called the **Jenks natural breaks classification method**, is one of the data clustering methods designed to determine the best arrangement of values into different classes. But before going any further, let’s look at what “Natural Breaks” mean.

** Natural Breaks: “**Natural breaks” are the best way to split up ranges. Best ranges imply the ranges where like areas are grouped together. This method minimizes the variation within each range, so the areas within each range are as close as possible in value to each other.

** Intuition**: The Jenks natural breaks algorithm, just like K-means, assigns data to one of K groups such that the within-group distances are minimized. Also just like K-means, one must select K prior to running the algorithm.

** Why is it not a good idea to do it manually: **It’s usually

**impractical**as there will be an overwhelming number of different ways to set ranges and

**inaccurate**as it destroys the objective display of data

**.**Of the few patterns the user can test, the “prettiest” pattern will almost certainly be selected, but that has nothing to do with the correct display of the data.

*Algorithm under the hood:*

*Algorithm under the hood:*

Let’s look at an example to understand how the algorithm works. Say our list of values is [4, 5, 9, 10] and we need to find the best ranges from that.

*Step 1:** Calculate the “sum of squared deviations for array mean” (SDAM).*

```
list = [4, 5, 9, 10]
mean = 7 #(4 + 5 + 9 + 10) / 4
SDAM = (4-7)^2 + (5-7)^2 + (9-7)^2 + (10-7)^2 = 9 + 4 + 4 + 9 = 26
```

*Step 2:** For each range combination, calculate “sum of squared deviations for class means” (SDCM_ALL), and find the smallest one. SDCM_ALL is similar to SDAM but uses class means and deviations.*

""" For [4][5,9,10] SDCM_ALL = (4-4)^2+(5-8)^2+(9-8)^2+(10-8)^2 = 0 + 9 + 1 + 4 = 14For [4,5][9,10] SDCM_ALL = (4-4.5)^2+(5-4.5)^2+(9-9.5)^2+(10-9.5)^2 = 0.25 + 0.25 + 0.25 + 0.25 = 1.For [4,5,9][10] SDCM_ALL = (4-6)^2+(5-6)^2+(9-6)^2+(10-10)^2 = 4 + 1 + 9 + 0 = 14. """

Observe that the middle one is having the lowest *SDCM implying minimum variance.*

*Step 3:** As a final summary measure, calculate a “goodness of variance fit” (GVF), defined as (SDAM — SCDM) / SDAM. GVF ranges from 1 (perfect fit) to 0 (awful fit).*

```
"""
GVF for [4,5][9,10] is (26 - 1) / 26 = 25 / 26 = 0.96
GVF for the other 2 ranges is (26 - 14) / 26 = 12 / 26 = 0.46
"""
```

GVF for [4,5][9,10] is the highest indicating that this combination is the best ranges for the list[4, 5, 9, 10] which makes sense intuitively.

** Things to know**: It’s a data-intensive algorithm. Take an example of splitting 254 items into 6 ranges. there are 8,301,429,675 possible range combinations. It might take the computer a little while to test so many combinations. So it’s usually a better practice to start with a low number of ranges, and only increase the number to a large number if needed.