Data & Files Intermediate

Working with data is central to any data science workflow. This lesson covers all the ways to get data into and out of Google Colab, from file uploads to Google Drive mounting, Kaggle integration, and more.

Uploading Files

Upload files from your local machine directly to the Colab runtime:

from google.colab import files

# Upload files from local machine
uploaded = files.upload()

# Access uploaded files
import pandas as pd
for filename in uploaded.keys():
    print(f"Uploaded: {filename}, Size: {len(uploaded[filename])} bytes")
    if filename.endswith('.csv'):
        df = pd.read_csv(filename)
        print(df.head())

💡

Important: Files uploaded directly to the runtime are stored in /content/ and are temporary. They are deleted when the runtime disconnects. For persistent storage, use Google Drive.

Google Drive Mounting

Mount your Google Drive to access files persistently:

from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# List files in your Drive
import os
drive_path = '/content/drive/MyDrive/'
print(os.listdir(drive_path))

# Read a file from Drive
df = pd.read_csv('/content/drive/MyDrive/datasets/sales_data.csv')
print(df.shape)

# Save results back to Drive
df.to_csv('/content/drive/MyDrive/results/output.csv', index=False)

# Force remount if needed
drive.mount('/content/drive', force_remount=True)

Reading from URLs

Load data directly from the web:

import pandas as pd

# Read CSV from URL
url = "https://raw.githubusercontent.com/datasets/iris/master/data/iris.csv"
df = pd.read_csv(url)
print(df.head())

# Download any file from URL
!wget https://example.com/data/large_file.zip
!unzip large_file.zip

# Using gdown for Google Drive public links
!pip install gdown
!gdown https://drive.google.com/uc?id=YOUR_FILE_ID

# Clone a GitHub repository
!git clone https://github.com/username/repo.git

Kaggle Dataset Integration

Import datasets directly from Kaggle:

# Step 1: Upload your Kaggle API key (kaggle.json)
from google.colab import files
files.upload()  # Upload kaggle.json

# Step 2: Set up Kaggle credentials
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Step 3: Download a dataset
!kaggle datasets download -d heptapod/titanic
!unzip titanic.zip

# Step 4: Download competition data
!kaggle competitions download -c titanic
!unzip titanic.zip -d titanic_data/

Google Sheets Integration

Read and write data from Google Sheets:

from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)

# Open a spreadsheet by URL
spreadsheet = gc.open_by_url('https://docs.google.com/spreadsheets/d/YOUR_SHEET_ID')
worksheet = spreadsheet.sheet1

# Read all data into a DataFrame
import pandas as pd
data = worksheet.get_all_records()
df = pd.DataFrame(data)
print(df.head())

# Write data back to a sheet
worksheet.update('A1', [df.columns.tolist()] + df.values.tolist())

BigQuery Access

Query Google BigQuery datasets directly from Colab:

from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
client = bigquery.Client(project='your-project-id')

# Run a SQL query
query = """
SELECT name, SUM(number) as total
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY name
ORDER BY total DESC
LIMIT 10
"""

df = client.query(query).to_dataframe()
print(df)

# Or use the magic command
%load_ext google.cloud.bigquery
%%bigquery df_result
SELECT * FROM `bigquery-public-data.samples.shakespeare` LIMIT 10

Saving and Downloading Outputs

from google.colab import files

# Save a DataFrame and download
df.to_csv('results.csv', index=False)
files.download('results.csv')

# Save a plot and download
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.savefig('plot.png', dpi=150)
files.download('plot.png')

# Save model weights and download
import torch
torch.save(model.state_dict(), 'model_weights.pth')
files.download('model_weights.pth')

Working with Large Datasets

Tips for handling large datasets in Colab:

Use chunks: Read CSV files in chunks with pd.read_csv(file, chunksize=10000)
Parquet format: Use .parquet instead of .csv for faster I/O and smaller files
Google Drive: Store large datasets on Drive to avoid re-uploading each session
Data generators: Use TensorFlow/PyTorch data loaders that stream data in batches
Compression: Compress datasets with gzip or zip to save storage and speed up transfers

Data Persistence Between Sessions

✅

Persistence Strategy:

Google Drive: Mount Drive and save important files there. They persist across sessions
Runtime files (/content/): Temporary. Lost when the runtime disconnects
Installed packages: Must be reinstalled each session. Put !pip install at the top
Model checkpoints: Save to Google Drive during training to avoid losing progress

← GPU & TPU Best Practices →