Accessing Datasets in the Cloud from Google Colab Notebooks
We may all have discovered Google Colab by now but we have to deal with their tendency to push us towards other Google services in order to obtain a fluid experience.
I have found getting data from Google Drive to be fairly straightforward and easy but sometimes it is one or two steps too many for wrangling data.
from google.colab import drive drive.mount('/content/gdrive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/ big-long-url</p> <p>
Enter your authorization code: ·········· Mounted at /content/drive
In putting together helpful posts about machine learning, wrangling and visualization, I have found that Github is the best place to maintain files, data, code etc. where I can keep track of it and continually share it. I also store smaller datasets right on this site and access it with the content URL.
Obtaining the actual raw link to a file on Github makes it possible to use Pandas pd.read_csv() to get it into a notebook on Colab. This allows me to skip the process of re-uploading to Drive, mounting the drive and then getting the correct path on the drive when accessing it with read_csv().
The following method wget’s the file from github as raw and saves it to a local directory on your colab instance.
import pandas as pd !wget https://raw.githubusercontent.com/thoughtsociety /ads_track4/master/datasets/titanic_data.csv !ls -al titanic_data.csv 100%[===================>] 73.41K --.-KB/s in 0.03s titanic = pd.read_csv('titanic_data.csv') titanic.head() Survived Pclass Name Sex Age Siblings/Spouse aboard Parch Ticket Fare Cabin Embarked 0 0 3 "Braund, Mr. Owen Harris" 1 22.0 1 0 A/5 21171 7.2500 NaN S 1 1 1 "Cumings, Mrs. John Bradley (Florence Briggs T... 0 38.0 1 0 PC 17599 71.2833 C85 C 2 1 3 "Heikkinen, Miss. Laina" 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 1 1 "Futrelle, Mrs. Jacques Heath (Lily May Peel)" 0 35.0 1 0 113803 53.1000 C123 S 4 0 3 "Allen, Mr. William Henry" 1 35.0 0 0 373450 8.0500 NaN S
The url can be anything else such as
That would be one way you could fetch a zipped dataset and subsequently unzip it in a cell.
If your file was in final form, a .csv for instance, you can just read it directly with pd.read_csv(URL) and you would be done
titanic_file_path = 'https://raw.githubusercontent.com/thoughtsociety /wrangling/master/datasets/titanic_data.csv' titanic = pd.read_csv(titanic_file_path) titanic.head()
Knowing how to do this on Colab, allows you to easily work with cloud-based notebooks which is turning out to be a great resource for data science and machine-learning adherents.