Machine Learning

Practitioner

MLPractitioner.com

We are on your path to a practice in Machine Learning and Artificial Intelligence

Consider MLP your  resource and information site for news, tools and techniques to maintain your skillset

Stories interesting to Practitioners of Machine Learning and A.I.

What is Kaggle?

Kaggle is a data science learning site making use of ‘competitions‘ as a vehicle for ‘reinforcement learning’ by people rather than machines. Although reinforcement learning is a valid and important Machine Learning construct, using ‘gaming’ as a motivator with rewards tends to reinforce one’s own learning paths.

Kaggle has been around since 2009. Since Google acquired it in 2017, giving itself access to young and upcoming solvers in data science, Kaggle has not slowed. Although I am not in the ‘young and upcoming’ category, I am completely self-taught and a budding amateur data scientist due in large part to my hanging around Kaggle.

Kaggle follows the open source metaphor very well. Datasets and completed competition submissions are shared publicly. Users submit predictions and findings to competitions and provide their ‘proof’ in the form of ‘kernels‘ which are essentially Jupyter Notebooks hosted on Kaggle. There is a leaderboard with some gaming roles assigned to winners, runner-ups and others who participate. Winning multiple competitions will put you in top percentile categories in the community.

Some of Kaggle’s competitions can get very serious, result in huge financial awards and actually help society to solve big problems with Machine Learning. Joining the community and tackling some of the basic challenges will put some ‘muscle-memory’ for data science into your head and get you going as it did for me.

Datasets

A lot of competitions are inspired by the acquisition of a dataset that lends itself to machine-learning applications. Kaggle often preprocesses datasets submitted for competitions but you may find that many datasets need hygiene, aggregations, scaling and transformations to be submitted to a model. When companies submit datasets, they too are often looking for a solution that converts unstructured into structured data. Check out some of the datasets hanging around Kaggle.

According to the survey, respondents say that they spend 15% of their time cleaning and 11% gathering data.

The Kaggle Community

One of the things that makes Kaggle ‘Great’ is its user community. The insights coming from the most recent user’s survey for 2018 are gleaned from 23,859 responses. The results are quite informative when you see who is using what tools, what their focus is and the age, gender, profession and other attributes of Kaggle users.

The charts that were made by Paul Mooney, Developer Advocate at Kaggle illustrate a few noteworthy data points:

 

# of Respondents per age group – Highest number were the 25-29 age group. 6000
# of Respondents per Gender – Male disproportionately responded vs. Female 19.43k vs 4k
# of Respondents per country- U.S. was highest, then India then China – Notable : UK was 702 US – 4716, India – 4417, China – 1644
# of Respondents per Job Title : Students were the highest, Data Engineer lowest Student – 5253, D.E. 737
# of Respondents per IDE – Jupyter/Python highest, Rstudio next, Notepad, Python, Sublime Jupyter – 14k, Rstudio – 8503, Pycharm – 7k, Sublime – 6k
# of Respondents per Most Commonly Used Programming Language – Python wins Py – 8180 – R – 2046
# of Respondents per Recommended Language to Learn First – Python again Py – 14,181, R -2341
# of Respondents per Machine Learning Library – Scikitlearn wins SKL – 12,249, Tensorflow – 9900, Keras – 8136, Pytorch – 3186
# of Respondents per Most Commonly Used Data Visualization Library (plotly has a high learning curve but all of the plots in this blurb were done with it. Worth learning.) Matplotlib – 6787, ggplot – 2877, Seaborn – 1344, Plotly – 540
# of Respondents per Years Using Machine Learning Methods < 1 yr – 6271, 2-3 yrs – 2k, 5-10 yrs – 895
# of Respondents per Most Commonly Used Data Type – Num, Tabular, Text, Time-Serie, Image N-3588, T-2680, Txt-2005, TS-1664, Img – 1635
# of Respondents per % Time Engaged in Various Tasks – Data gathering, Cleaning, Visualizing, Model-Building, Model Production, Finding and communicating Insights Cleaning – 15%,MB’ng – 13%, Gathering – 11%, Visual – 9%, Insights – 7.5%, Production – 6%
# of Respondents per Importance of Various Social Issues – Fairness/Bias, Explaining and Reproducing – Yikes! Very Important – 10.2k, Slightly important – 4525, Not at all: 545
# of Respondents per ML Media Source – Kaggle forum and Medium posts : Top 2 Check out the chart below
# of Respondents per % ML/DS Education from Various Sources Self-Taught – 19%, Online course – 12%, University – 11.9%, Work – 11%

 

Open the Toggle below to play with the Plot.ly charts made by Paul and See Why Kaggle is Great
2018 Kaggle User Survey Charts

















Why Kaggle is Great

by | Nov 20, 2019 | Data-Stories, StoriesFeat | 0 comments

0 Comments

Submit a Comment

Your email address will not be published.