Machine Learning

Practitioner

MLPractitioner.com

We are on your path to a practice in Machine Learning and Artificial Intelligence

Consider MLP your  resource and information site for news, tools and techniques to maintain your skillset

Accessing Datasets in the Cloud from Google Colab Notebooks

We may all have discovered Google Colab by now but we have to deal with their tendency to push us towards other Google services in order to obtain a fluid experience.

I have found getting data from Google Drive to be fairly straightforward and easy but sometimes it is one or two steps too many for wrangling data.

from google.colab import drive
drive.mount('/content/gdrive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/ big-long-url</p>
<p>
Enter your authorization code: ·········· Mounted at /content/drive

In putting together helpful posts about machine learning, wrangling and visualization, I have found that Github is the best place to maintain files, data, code etc. where I can keep track of it and continually share it. I also store smaller datasets right on this site and access it with the content URL.

Obtaining the actual raw link to a file on Github makes it possible to use Pandas pd.read_csv() to get it into a notebook on Colab. This allows me to skip the process of re-uploading to Drive, mounting the drive and then getting the correct path on the drive when accessing it with read_csv().

The following method wget’s the file from github as raw and saves it to a local directory on your colab instance. 

import pandas as pd

!wget https://raw.githubusercontent.com/thoughtsociety
/ads_track4/master/datasets/titanic_data.csv

!ls -al
titanic_data.csv    100%[===================&gt;]  73.41K  --.-KB/s    in 0.03s   

titanic = pd.read_csv('titanic_data.csv')
titanic.head()

Survived	Pclass	Name	Sex	Age	Siblings/Spouse aboard	Parch	Ticket	Fare	Cabin	Embarked
0	0	3	"Braund, Mr. Owen Harris"	1	22.0	1	0	A/5 21171	7.2500	NaN	S
1	1	1	"Cumings, Mrs. John Bradley (Florence Briggs T...	0	38.0	1	0	PC 17599	71.2833	C85	C
2	1	3	"Heikkinen, Miss. Laina"	0	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	"Futrelle, Mrs. Jacques Heath (Lily May Peel)"	0	35.0	1	0	113803	53.1000	C123	S
4	0	3	"Allen, Mr. William Henry"	1	35.0	0	0	373450	8.0500	NaN	S

The url can be anything else such as

!wget https://mlpractitioner.com/wp-content/uploads/titanic_data.csv

That would be one way you could fetch a zipped dataset and subsequently unzip it in a cell.  

If your file was in final form, a .csv for instance, you can just read it directly with pd.read_csv(URL) and you would be done

 

titanic_file_path = 'https://raw.githubusercontent.com/thoughtsociety
/wrangling/master/datasets/titanic_data.csv'

titanic = pd.read_csv(titanic_file_path)
titanic.head()

Knowing how to do this on Colab, allows you to easily work with cloud-based notebooks which is turning out to be a great resource for data science and machine-learning adherents.

Stories interesting to Practitioners of Machine Learning and A.I.

Data from URL in Colab Notebook

by | Nov 27, 2019 | Data-Stories, StoriesFeat | 0 comments

0 Comments

Submit a Comment

Your email address will not be published.