Datasets

Effective collaboration on datasets is the key ingredient of any data science project. dstack.ai offers APIs and services to upload datasets, track their revisions and to share these datasets securely within teams (or publicly if needed).

Uploading datasets and visualization to dstack.ai is done via the dstack package available for both Python and R. These packages can be used from Jupyter notebooks, RMarkdown, Python and R scripts and applications. Learn how to install dstack package

Pushing single datasets

Here's an example of the code that pushes a single dataset to dstack.ai:

Python
R
Python
import pandas as pd
import numpy as np
from dstack import push_frame
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
push_frame("static_dataset_example", df, "static dataset")
R
library(ggplot2)
library(dstack)
data("midwest", package = "ggplot2")
push_frame("simple", midwest, "My first dataset")

The API supports datasets of any size. You can work with small datasets as well as with very large datasets.

Pushing multiple datasets

In some cases, you want to push multiple datasets at once and associate each with own parameter. Suppose you'd to publish multiple datasets on players for every parameterCollege:

Python
Python
import pandas as pd
from dstack import create_frame
df = pd.read_csv("player_data.csv").dropna()
frame = create_frame("player_data")
pdf = df['college'].value_counts().rename_axis('college').reset_index(name='players').head(10)
frame.commit(pdf, f"Top 10 colleges by number of players", { "College": "Top 10 colleges" })
for college in df["college"].unique():
players = df.loc[df["college"] == college]
frame.commit(players, f"Players from {college}", { "College": college })
frame.push()

Once the dataset is pushed to dstack.ai, it can be accessed by the URL specified in the frame and the username configured with the dstack package: https://<username>/<stackname>

All datasets pushed to dstack.ai follow the privacy settings specified for the registered profile. You can make all data submitted to dstack.ai either public or private. You also can change the privacy settings for individual datasets to either public or private, or share them only with selected users. Learn more on sharing and collaboration

Pulling datasets

Imagine a scenario that you would like to use a dataset published earlier or from someone else. In this case, you have two options to obtain the dataset to use it:

  1. Download it from dstack.ai as a CSV file

  2. Fetch the dataset from Python or R using the dstack package:

Python
R
Python
import pandas as pd
from dstack import pull
df = pd.read_csv(pull("/<username>/<stackname"))
head(df)
R
library(dstack)
df <- read.csv(dstack::pull("/<username>/<stackname")
head(df)

Note, in case you'd like to pull a dataset that is associated with a specific parameter, you have to specify this parameter as an argument of the pull function. Here's an example:

Python
R
Python
import pandas as pd
from dstack import pull
df = pd.read_csv(pull("/cheptsov/player_data", College = "Top 10 colleges"))
df.head()
R
library(dstack)
df <- read.csv(dstack::pull("/cheptsov/player_data", College = "Top 10 colleges")
head(df)

The dstack package is compatible withpandas.core.frame.DataFrame, data.frame, data.table, andtibble.