We see data. There are many different ways to visualize or crunch data.
Share some datasets, munging techniques, scraping, structuring, storing and retrieving
is the fuel for our practice. Share yours and we will share ours, naturally. Some of these you probably already have or have used. As we acquire or become familiar with more, we will post them here. You will also see a visual pertaining to the dataset that is consistent with its structure and typical use-cases.
Most Up-voted Kaggle Datasets
Kagglers are very active in data science. Here are the top-10 datasets as voted by the community. We’ll try to keep this updated on a regular basis. If you do something cool with one of these, let us know.
Date Range for ratings : 10/1/19 – 10/12/19
#1 | Credit Card Fraud Dataset
Banks and CC companies are in need of a smart way to discern between authorized and fraudalant transactions.
The dataset is based on mining and fraud detection of anonymized EU credit card transactions from 2013.
Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) collaborated
#4 | European Soccer Database
This is a monster of a dataset requiring a sophisticated crawler to acquire upwards of 25,000 matches, 10,000 players over 11 European countries.
It spans 2008-2016 seasons with player attributes sourced from EA Sports FIFA video games.
Also includes betting odds from 10 providers.
#7 | FIFA 19 complete player dataset
This is similar to the 08-2016 Fifa data except it covers only 2019 stats.
#10 | Suicide Rates 1985 to 2016
Suicide prevention has been an art more than a science up to now. Finding the signals that relate to increased suicide rates has been statistically challenging. This data compares socio-economic info with suicide rates by year and country.
Use of this database which spans 1985-2016, has 234 kaggle kernels dedicated to it.
#2 | Heart Disease UCI
Using 14 of 76 attributes, this dataset from Cleveland turns out to be the primary Machine Learning database for heart disease. Target is the presence integer grade value 0 for not present to 4 for definitely present.
#5 | Wine Reviews
The wine dataset strives to identify wines as a Sommelier would. The data comes from winemag.com reviews.
The objective of the acquirer was create a model that can identify the variety, winery, and location of a wine based on a description.
Using text-related prediction, the dataset offers a rich corpus to model wine identification like a taster would only without actually tasting them.
#8 | Global Terrorism Database
Terrorist attacks have been collected between 1970 and the present on the Global Terrorism Database. This dataset includes information through 2017.
Thanks to the National Consortium for the Study of Terrorism and Responses to Terrorism (START), headquartered at the University of Maryland, the db contains 180,000 attacks showing location, tactics, perpetrators, targets, and outcomes.
#3 | Google Play Store Apps
Google Play store data scraped from pages with dynamic page load with JQuery making this dataset a hard to get project. An app takes one row containing category, rating, size, version, installs and the like.
This dataset allow the analyst to extract market insights for product developers attacking the Android market.
#6 | TMDB 5000 Movie Dataset
Having been asked to remove the iMdb dataset that was previously posted on Kaggle, the acquirer turned to TMdb which does have an open API.
From this dataset, one could model the success or rating of a film based on information about the crew, cast, budget, revenue and popularity. There are 20 features to make predictions from.
#9 | Trending YouTube Statistics
The YouTube Trending Video Statistics represents a slice of pop-culture writ live online.
Using the API, Mitchell J made the effort to capture number of views, shares, comments and likes daily in order to determine how they are trending.
One can also examine or use the code that was written to gather the data.
There are 487 kernels using the dataset.
A collection of datasets, category, description and links to save some time searching. These are datasets that have been used by beginners and pro practitioners.
We’ll also share machine-learning models that are reloadable into your projects, notebooks and scripts.
If you know of an interesting dataset that you or an associate has worked with and it is open-sourced, please let us know and we will add it to the collection.
Well-worn datasets used for learning
|Titanic||Titanic Dataset||Titanic: Machine Learning from Disaster|
Predicting survival rates on the Titanic Liner Sinking
|Iris||Iris Dataset||4 different species of the Iris flower. Predict species from size of different parts of the flower.||150 Observations|
With categorical classification target
|Breast Cancer Wisconsin (Diagnostic)||Breast Cancer Dataset||Biopsy data used to characterize cellular state||569 Observations |
|NYC AirBnB||NYC AirBnB Dataset||Listing activity and metrics in NYC for 2019.||49,000 Observations|
|Numerical and Categorical Data
|Github Awesome Datasets Repo||Awesome Datasets on Github||Topic-centric public data sources in high-quality||Hundreds of datasets||All kinds|
|Fer2013||Facial Emotion Recognition Challenge Dataset||30,000+ emotion-labeled face images. Part of the Kaggle challenge in Representation Learning: Facial Expression Recognition Challenge||92 MB||Image arrays|
|Trump Photos||Trump Photos||A set of jpg images captured of trump's face||All different sizes||Deep Learning Emotion Detection|
|16 Trump Photos for Emotion Project||trump_16_faces2.zip||16 cropped, color Trump facial expressions for the emotion-detection project||All different sizes||Deep Learning projects|
Machine Learning Models To Share
|Emotion Detection CNN||Facial Emotion Detection CNN Model||Priya Dwivedi's Face_and_emotion_detection model built as a deep-learning CNN||Facial rec, emotion rec, deep learning|