Skip to content

kaggle dataset connector #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions ds_kaggle_dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
This library is made to download datasets from Kaggle

## Parameters

```--kaggle_dataset_name``` - string, required. The name of the dataset.

```--target_path``` - string (default = /cnvrg/output). The path to save the dataset files to

```--cnvrg_dataset``` - string (default = None). If provided then the files that were downloaded from Kaggle will be uploaded straight into the provided dataset, if the dataset does not exist a new one will be created

```--file_name``` - string (default = None). If provided then the library will download this specific file from the dataset

## Finding the dataset name
To grab a dataset name from Kaggle, navigate into your desired dataset page on Kaggle, afterwards copy the dataset name from the end of the url.
Paste the Kaggle dataset name into the `kaggle_dataset_name` field.

## Authentication

You can get your Kaggle API credentials by going into the user profile and under "Account" press "Create API Token".

This will download a "kaggle.json" file with your credentials.

It is recommended to use environment variables as authentication method. This library expects the following env variables:

* `KAGGLE_KEY` - The Kaggle API key
* `KAGGLE_USERNAME` - The Kaggle API username

The environment variables can be stored securely in the project secrets in cnvrg.


23 changes: 23 additions & 0 deletions ds_kaggle_dataset/library.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: Kaggle Dataset Connector
command: python3 main.py
description: " - Download datasets from Kaggle"
version: 1.0.12
docker_images: cnvrg
compute: small
data_type: csv
icon: python

arguments:
kaggle_dataset_name:
value: ''
type: categorical
target_path:
value: '/cnvrg/output'
type: categorical
cnvrg_dataset:
value: 'None'
type: categorical
file_name:
value: 'None'
type: categorical
45 changes: 45 additions & 0 deletions ds_kaggle_dataset/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import os
import argparse

parser = argparse.ArgumentParser(description="""Kaggle Dataset Connector""")
parser.add_argument('--dataset_name', action='store', dest='dataset_name', required=True, help="""--- The name of the dataset ---""")

parser.add_argument('--dataset_path', action='store', dest='dataset_path', default='/cnvrg/output', help="""--- The path to save the dataset files to ---""")

parser.add_argument('--cnvrg_dataset', action='store', dest='cnvrg_dataset', required=False, default='None', help="""--- the name of the cnvrg dataset to store in ---""")

parser.add_argument('--file_name', action='store', dest='file_name', required=False, default='None', help="""--- If a single file is needed then this is the name of the file ---""")

parser.add_argument('--project_dir', action='store', dest='project_dir', help="""--- For inner use of cnvrg.io ---""")

parser.add_argument('--output_dir', action='store', dest='output_dir', help="""--- For inner use of cnvrg.io ---""")

args = parser.parse_args()
dataset_name = args.dataset_name
dataset_path = args.dataset_path
file_name = args.file_name
cnvrg_dataset = args.cnvrg_dataset

download_command = f'kaggle datasets download {dataset_name} --unzip'

if dataset_path:
download_command += f' -p {dataset_path}'
if file_name.lower() != 'none':
download_command += f' -f {file_name}'

print(f'Downloading dataset {dataset_name} to {dataset_path}')
os.system(download_command)

if cnvrg_dataset.lower() != 'none':
from cnvrgp import Cnvrg
cnvrg = Cnvrg()
ds = cnvrg.datasets.get(cnvrg_dataset)
try:
ds.reload()
except:
print('The provided Dataset was not found')
print(f'Creating a new dataset named {cnvrg_dataset}')
ds = cnvrg.datasets.create(name=cnvrg_dataset)
print('Uploading files to Cnvrg dataset')
ds.put_files(paths=[dataset_path])

Empty file added ds_kaggle_dataset/prerun.sh
Empty file.
3 changes: 3 additions & 0 deletions ds_kaggle_dataset/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
--extra-index-url https://test.pypi.org/simple/
cnvrg-new
kaggle==1.5.12