Stories interesting to Practitioners of Machine Learning and A.I.
What is Kaggle?
Kaggle is a data science learning site making use of ‘competitions‘ as a vehicle for ‘reinforcement learning’ by people rather than machines. Although reinforcement learning is a valid and important Machine Learning construct, using ‘gaming’ as a motivator with rewards tends to reinforce one’s own learning paths.
Kaggle has been around since 2009. Since Google acquired it in 2017, giving itself access to young and upcoming solvers in data science, Kaggle has not slowed. Although I am not in the ‘young and upcoming’ category, I am completely self-taught and a budding amateur data scientist due in large part to my hanging around Kaggle.
Kaggle follows the open source metaphor very well. Datasets and completed competition submissions are shared publicly. Users submit predictions and findings to competitions and provide their ‘proof’ in the form of ‘kernels‘ which are essentially Jupyter Notebooks hosted on Kaggle. There is a leaderboard with some gaming roles assigned to winners, runner-ups and others who participate. Winning multiple competitions will put you in top percentile categories in the community.
Some of Kaggle’s competitions can get very serious, result in huge financial awards and actually help society to solve big problems with Machine Learning. Joining the community and tackling some of the basic challenges will put some ‘muscle-memory’ for data science into your head and get you going as it did for me.
A lot of competitions are inspired by the acquisition of a dataset that lends itself to machine-learning applications. Kaggle often preprocesses datasets submitted for competitions but you may find that many datasets need hygiene, aggregations, scaling and transformations to be submitted to a model. When companies submit datasets, they too are often looking for a solution that converts unstructured into structured data. Check out some of the datasets hanging around Kaggle.
According to the survey, respondents say that they spend 15% of their time cleaning and 11% gathering data.
The Kaggle Community
One of the things that makes Kaggle ‘Great’ is its user community. The insights coming from the most recent user’s survey for 2018 are gleaned from 23,859 responses. The results are quite informative when you see who is using what tools, what their focus is and the age, gender, profession and other attributes of Kaggle users.
The charts that were made by Paul Mooney, Developer Advocate at Kaggle illustrate a few noteworthy data points:
|# of Respondents per age group – Highest number were the 25-29 age group.||6000|
|# of Respondents per Gender – Male disproportionately responded vs. Female||19.43k vs 4k|
|# of Respondents per country- U.S. was highest, then India then China – Notable : UK was 702||US – 4716, India – 4417, China – 1644|
|# of Respondents per Job Title : Students were the highest, Data Engineer lowest||Student – 5253, D.E. 737|
|# of Respondents per IDE – Jupyter/Python highest, Rstudio next, Notepad, Python, Sublime||Jupyter – 14k, Rstudio – 8503, Pycharm – 7k, Sublime – 6k|
|# of Respondents per Most Commonly Used Programming Language – Python wins||Py – 8180 – R – 2046|
|# of Respondents per Recommended Language to Learn First – Python again||Py – 14,181, R -2341|
|# of Respondents per Machine Learning Library – Scikitlearn wins||SKL – 12,249, Tensorflow – 9900, Keras – 8136, Pytorch – 3186|
|# of Respondents per Most Commonly Used Data Visualization Library (plotly has a high learning curve but all of the plots in this blurb were done with it. Worth learning.)||Matplotlib – 6787, ggplot – 2877, Seaborn – 1344, Plotly – 540|
|# of Respondents per Years Using Machine Learning Methods||< 1 yr – 6271, 2-3 yrs – 2k, 5-10 yrs – 895|
|# of Respondents per Most Commonly Used Data Type – Num, Tabular, Text, Time-Serie, Image||N-3588, T-2680, Txt-2005, TS-1664, Img – 1635|
|# of Respondents per % Time Engaged in Various Tasks – Data gathering, Cleaning, Visualizing, Model-Building, Model Production, Finding and communicating Insights||Cleaning – 15%,MB’ng – 13%, Gathering – 11%, Visual – 9%, Insights – 7.5%, Production – 6%|
|# of Respondents per Importance of Various Social Issues – Fairness/Bias, Explaining and Reproducing – Yikes!||Very Important – 10.2k, Slightly important – 4525, Not at all: 545|
|# of Respondents per ML Media Source – Kaggle forum and Medium posts : Top 2||Check out the chart below|
|# of Respondents per % ML/DS Education from Various Sources||Self-Taught – 19%, Online course – 12%, University – 11.9%, Work – 11%|