Data & Files Intermediate
Working with data is central to any data science workflow. This lesson covers all the ways to get data into and out of Google Colab, from file uploads to Google Drive mounting, Kaggle integration, and more.
Uploading Files
Upload files from your local machine directly to the Colab runtime:
from google.colab import files
# Upload files from local machine
uploaded = files.upload()
# Access uploaded files
import pandas as pd
for filename in uploaded.keys():
print(f"Uploaded: {filename}, Size: {len(uploaded[filename])} bytes")
if filename.endswith('.csv'):
df = pd.read_csv(filename)
print(df.head())
Important: Files uploaded directly to the runtime are stored in
/content/ and are temporary. They are deleted when the runtime disconnects. For persistent storage, use Google Drive.Google Drive Mounting
Mount your Google Drive to access files persistently:
from google.colab import drive
# Mount Google Drive
drive.mount('/content/drive')
# List files in your Drive
import os
drive_path = '/content/drive/MyDrive/'
print(os.listdir(drive_path))
# Read a file from Drive
df = pd.read_csv('/content/drive/MyDrive/datasets/sales_data.csv')
print(df.shape)
# Save results back to Drive
df.to_csv('/content/drive/MyDrive/results/output.csv', index=False)
# Force remount if needed
drive.mount('/content/drive', force_remount=True)
Reading from URLs
Load data directly from the web:
import pandas as pd
# Read CSV from URL
url = "https://raw.githubusercontent.com/datasets/iris/master/data/iris.csv"
df = pd.read_csv(url)
print(df.head())
# Download any file from URL
!wget https://example.com/data/large_file.zip
!unzip large_file.zip
# Using gdown for Google Drive public links
!pip install gdown
!gdown https://drive.google.com/uc?id=YOUR_FILE_ID
# Clone a GitHub repository
!git clone https://github.com/username/repo.git
Kaggle Dataset Integration
Import datasets directly from Kaggle:
# Step 1: Upload your Kaggle API key (kaggle.json)
from google.colab import files
files.upload() # Upload kaggle.json
# Step 2: Set up Kaggle credentials
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
# Step 3: Download a dataset
!kaggle datasets download -d heptapod/titanic
!unzip titanic.zip
# Step 4: Download competition data
!kaggle competitions download -c titanic
!unzip titanic.zip -d titanic_data/
Google Sheets Integration
Read and write data from Google Sheets:
from google.colab import auth
auth.authenticate_user()
import gspread
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)
# Open a spreadsheet by URL
spreadsheet = gc.open_by_url('https://docs.google.com/spreadsheets/d/YOUR_SHEET_ID')
worksheet = spreadsheet.sheet1
# Read all data into a DataFrame
import pandas as pd
data = worksheet.get_all_records()
df = pd.DataFrame(data)
print(df.head())
# Write data back to a sheet
worksheet.update('A1', [df.columns.tolist()] + df.values.tolist())
BigQuery Access
Query Google BigQuery datasets directly from Colab:
from google.colab import auth
auth.authenticate_user()
from google.cloud import bigquery
client = bigquery.Client(project='your-project-id')
# Run a SQL query
query = """
SELECT name, SUM(number) as total
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY name
ORDER BY total DESC
LIMIT 10
"""
df = client.query(query).to_dataframe()
print(df)
# Or use the magic command
%load_ext google.cloud.bigquery
%%bigquery df_result
SELECT * FROM `bigquery-public-data.samples.shakespeare` LIMIT 10
Saving and Downloading Outputs
from google.colab import files
# Save a DataFrame and download
df.to_csv('results.csv', index=False)
files.download('results.csv')
# Save a plot and download
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.savefig('plot.png', dpi=150)
files.download('plot.png')
# Save model weights and download
import torch
torch.save(model.state_dict(), 'model_weights.pth')
files.download('model_weights.pth')
Working with Large Datasets
Tips for handling large datasets in Colab:
- Use chunks: Read CSV files in chunks with
pd.read_csv(file, chunksize=10000) - Parquet format: Use
.parquetinstead of.csvfor faster I/O and smaller files - Google Drive: Store large datasets on Drive to avoid re-uploading each session
- Data generators: Use TensorFlow/PyTorch data loaders that stream data in batches
- Compression: Compress datasets with gzip or zip to save storage and speed up transfers
Data Persistence Between Sessions
Persistence Strategy:
- Google Drive: Mount Drive and save important files there. They persist across sessions
- Runtime files (
/content/): Temporary. Lost when the runtime disconnects - Installed packages: Must be reinstalled each session. Put
!pip installat the top - Model checkpoints: Save to Google Drive during training to avoid losing progress
Lilly Tech Systems