I was trying to handle a datasets from all Brazil voters, more than 600Mb in a csv file. I tried for a while to use on local machine....was a waste of time. I tried to use Pandas, but my machine getting stuck...so I remember that to make a product, chemistry is like magic...using something of this, other from there, and boom...works perfectly.
You must have account on Kaggle and Google Colab, from there I follow the steps below:
. Generate you API key on Kaggle on Account settings, will download a json file.
. Open a folder on Google Drive to save your kaggle.json
. On Google Colab let start to code. Mounted the Google Drive content
from google.colab import drive
drive.mount('/content/gdrive')
. Save the json config on local environment.
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"
. Access Kaggle folder on Goggle Drive.
%cd /content/gdrive/My Drive/Kaggle
. Download the datasets to the folder
!kaggle datasets download -d felipelisboa/br-voters-2022
. Unzip the datasets
!unzip \*.zip && rm *.zip
. Install PySpark and start to Clean and Transform the datasets. The following I don't know if is the right way, but was the way that I learned on some courses that I did of pyspark :)
!pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("voters").getOrCreate()
df = spark.read.csv('perfil_eleitorado_ATUAL.csv', header=True, sep=';')
df = df.cache()
df.show()
To access the dataset is like using from local machine.
spark = SparkSession.builder.appName("voters").getOrCreate()
df = spark.read.csv('perfil_eleitorado_ATUAL.csv', header=True, sep=';')
Inspired by: Vidhya