Python, Pyspark, Kaggle and Colab

Chemistry to handle big datasets

Python, Pyspark, Kaggle and Colab

I was trying to handle a datasets from all Brazil voters, more than 600Mb in a csv file. I tried for a while to use on local machine....was a waste of time. I tried to use Pandas, but my machine getting stuck...so I remember that to make a product, chemistry is like magic...using something of this, other from there, and boom...works perfectly.

You must have account on Kaggle and Google Colab, from there I follow the steps below:

. Generate you API key on Kaggle on Account settings, will download a json file.

. Open a folder on Google Drive to save your kaggle.json

image.png

. On Google Colab let start to code. Mounted the Google Drive content

from google.colab import drive
drive.mount('/content/gdrive')

. Save the json config on local environment.

import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"

. Access Kaggle folder on Goggle Drive.

%cd /content/gdrive/My Drive/Kaggle

. Download the datasets to the folder

!kaggle datasets download -d felipelisboa/br-voters-2022

. Unzip the datasets

!unzip \*.zip  && rm *.zip

. Install PySpark and start to Clean and Transform the datasets. The following I don't know if is the right way, but was the way that I learned on some courses that I did of pyspark :)

!pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("voters").getOrCreate()
df = spark.read.csv('perfil_eleitorado_ATUAL.csv', header=True, sep=';')
df = df.cache()
df.show()

To access the dataset is like using from local machine.

spark = SparkSession.builder.appName("voters").getOrCreate()
df = spark.read.csv('perfil_eleitorado_ATUAL.csv', header=True, sep=';')

Inspired by: Vidhya