Data Sources
This article explains what data sources are and how to create and use them.
Data sources are a type of workload asset and represent a location where data is actually stored. They may represent a remote data location, such as NFS, Git, or S3, or a Kubernetes local resource, such as PVC, ConfigMap, HostPath, or Secret.
This configuration simplifies the mapping of the data into the workload’s file system and handles the mounting process during workload creation for reading and writing. These data sources are reusable and can be easily integrated and used by AI practitioners while submitting workloads across various scopes.
Data sources table¶
The data sources table can be found under Workload manager in the Run:ai platform.
The data sources table provides a list of all the data sources defined in the platform and allows you to manage them.
The data sources table comprises the following columns:
Column | Description |
---|---|
Data source | The name of the data source |
Description | A description of the data source |
Type | The type of data source connected – e.g., S3 bucket, PVC, or others |
Status | The different lifecycle phases and representation of the data source condition |
Scope | The scope of the data source within the organizational tree. Click the scope name to view the organizational tree diagram |
Kubernetes name | The unique name of the data sources Kubernetes name as it appears in the cluster |
Workload(s) | The list of existing workloads that use the data source |
Template(s) | The list of workload templates that use the data source |
Created by | The user who created the data source |
Creation time | The timestamp for when the data source was created |
Cluster | The cluster that the data source is associated with |
Data sources status¶
The following table describes the data sources' condition and whether they were created successfully for the selected scope.
Status | Description |
---|---|
No issues found | No issues were found while creating the data source |
Issues found | Issues were found while propagating the data source credentials |
Issues found | The data source couldn’t be created at the cluster |
Creating… | The data source is being created |
No status / “-” | When the data source’s scope is an account, the current version of the cluster is not up to date, or the asset is not a cluster-syncing entity, the status can’t be displayed |
Customizing the table view¶
- Filter - Click ADD FILTER, select the column to filter by, and enter the filter values
- Search - Click SEARCH and type the value to search by
- Sort - Click each column header to sort by
- Column selection - Click COLUMNS and select the columns to display in the table
- Download table - Click MORE and then click ‘Download as CSV’
- Refresh - Click REFRESH to update the table with the latest data
Adding a new data source¶
To create a new data source:
- Click +NEW DATA SOURCE
- Select the data source type from the list. Follow the step-by-step guide for each data source type:
NFS¶
A Network File System (NFS) is a Kubernetes concept used for sharing storage in the cluster among different pods. Like a PVC, the NFS volume’s content remains preserved, even outside the lifecycle of a single pod. However, unlike PVCs, which abstract storage management, NFS provides a method for network-based file sharing. The NFS volume can be pre-populated with data and can be mounted by multiple pod writers simultaneously. At Run:ai, an NFS-type data source is an abstraction that is mapped directly to a Kubernetes NFS volume. This integration allows multiple workloads under various scopes to mount and present the NFS data source.
- Select the cluster under which to create this data source
- Select a scope
- Enter a name for the data source. The name must be unique.
- Optional: Provide a description of the data source
- Set the data origin
- Enter the NFS server (host name or host IP)
- Enter the NFS path
- Set the data target location
- Container path
- Optional: Restrictions
- Prevent data modification - When enabled, the data will be mounted with read-only permissions
- Click CREATE DATA SOURCE
PVC¶
A Persistent Volume Claim (PVC) is a Kubernetes concept used for managing storage in the cluster, which can be provisioned by an administrator or dynamically by Kubernetes using a StorageClass. PVCs allow users to request specific sizes and access modes (read/write once, read-only many). Run:ai ensures that data remains consistent and accessible across various scopes and workloads, beyond the lifecycle of individual pods, which is efficient while working with large datasets typically associated with AI projects.
- Select the cluster under which to create this data source
- Select a scope
- Enter a name for the data source. The name must be unique.
- Optional: Provide a description of the data source
- Select PVC:
- Existing PVC
This option is relevant when the purpose is to create a PVC-type data source based on an existing PVC in the cluster- Select a PVC from the list - (The list is empty if no existing PVCs were created in advance)
- New PVC - creates a new PVC in the cluster. New PVCs are not added to the Existing PVCs list.
When creating a PVC-type data source and selecting the ‘New PVC’ option, the PVC is immediately created in the cluster (even if no workload has requested this PVC).
- Existing PVC
- Select the storage class
- None - Proceed without defining a storage class
- Custom storage class - This option applies when selecting a storage class based on existing storage classes.
To add new storage classes to the storage class list, and for additional information, check Kubernetes storage classes
- Select the access mode(s) (multiple modes can be selected)
- Read-write by one node - The volume can be mounted as read-write by a single node.
- Read-only by many nodes - The volume can be mounted as read-only by many nodes.
- Read-write by many nodes - The volume can be mounted as read-write by many nodes.
- Set the claim size and its units
- Select the volume mode
- File system (default) - allows the volume to be mounted as a filesystem, enabling the usage of directories and files.
- Block - exposes the volume as a block storage, which can be formatted or used by applications directly without a filesystem.
- Set the data target location
- container path
- Optional: Prevent data modification - When enabled, the data will be mounted with read-only permission.
- Click CREATE DATA SOURCE
After the data source is created, check its status to monitor its proper creation across the selected scope.
S3 Bucket¶
The S3 bucket data source enables the mapping of a remote S3 bucket into the workload’s file system. Similar to a PVC, this mapping remains accessible across different workload executions, extending beyond the lifecycle of individual pods. However, unlike PVCs, data stored in an S3 bucket resides remotely, which may lead to decreased performance during the execution of heavy machine learning workloads. As part of the Run:ai connection to the S3 bucket, you can create credentials in order to access and map private buckets.
- Select the cluster under which to create this data source
- Select a scope
- Enter a name for the data source. The name must be unique.
- Optional: Provide a description of the data source
- Set the data origin
- Set the S3 service URL
- Select the credentials
- None - for public buckets
- Credential names - This option is relevant for private buckets based on existing credentials that were created for the scope.
To add new credentials to the credentials list, and for additional information, check the Credentials article.
- Enter the bucket name
- Set the data target location
- container path
- Click CREATE DATA SOURCE
After a private data source is created, check its status to monitor its proper creation across the selected scope.
Git¶
A Git-type data source is a Run:ai integration, that enables code to be copied from a Git branch into a dedicated folder in the container. It is mainly used to provide the workload with the latest code repository. As part of the integration with Git, in order to access private repositories, you can add predefined credentials to the data source mapping.
- Select the cluster under which to create this data source
- Select a scope
- Enter a name for the data source. The name must be unique.
- Optional: Provide a description of the data source
- Set the data origin
- Set the Repository URL
- Set the Revision (branch, tag, or hash)- If left empty, it will use the 'HEAD' (latest)
- Select the credentials
- None - for public repositories
- Credential names - This option applies to private repositories based on existing credentials that were created for the scope.
To add new credentials to the credentials list, and for additional information, check the Credentials article.
- Set the data target location
- container path
- Click CREATE DATA SOURCE
After a private data source is created, check its status to monitor its proper creation across the selected scope.
Host path¶
A Host path volume is a Kubernetes concept that enables mounting a host path file or a directory on the workload’s file system. Like a PVC, the host path volume’s data persists across workloads under various scopes. It also enables data serving from the hosting node.
- Select the cluster under which to create this data source
- Select a scope
- Enter a name for the data source. The name must be unique.
- Optional: Provide a description of the data source
- Set the data origin
- host path
- Set the data target location
- container path
- Optional: Prevent data modification - When enabled, the data will be mounted with read-only permissions.
- Click CREATE DATA SOURCE
ConfigMap¶
A ConfigMap data source is a Run:ai abstraction for the Kubernetes ConfigMap concept. The ConfigMap is used mainly for storage that can be mounted on the workload container for non-confidential data. It is usually represented in key-value pairs (e.g., environment variables, command-line arguments etc.). It allows you to decouple environment-specific system configurations from your container images, so that your applications are easily portable. ConfigMaps must be created on the cluster prior to being used within the Run:ai system.
- Select the cluster under which to create this data source
- Select a scope
- Enter a name for the data source. The name must be unique.
- Optional: Provide a description of the data source
- Set the data origin
- Select the ConfigMap name (The list is empty if no existing ConfigMaps were created in advance).
- Set the data target location
- container path
- Click CREATE DATA SOURCE
Secret¶
A secret-type data source enables the mapping of a credential into the workload’s file system. Credentials are a workload asset that simplify the complexities of Kubernetes Secrets. The credentials mask sensitive access information, such as passwords, tokens, and access keys, which are necessary for gaining access to various resources.
- Select the cluster under which to create this data source
- Select a scope
- Enter a name for the data source. The name must be unique.
- Optional: Provide a description of the data source
- Set the data origin
- Select the credentials
To add new credentials, and for additional information, check the Credentials article.
- Select the credentials
- Set the data target location
- container path
- Click CREATE DATA SOURCE
After the data source is created, check its status to monitor its proper creation across the selected scope.
Note
It is also possible to add data sources directly when creating a specific workspace, training or inference workload
Editing a data source¶
To edit a data source:
- Select the data source from the table
- Click Rename to provide it with a new name
- Click Copy & Edit to make any changes to the data source
Deleting a data source¶
To delete a data source:
- Select the data source you want to delete
- Click DELETE
- Confirm you want to delete the data source
Note
It is not possible to delete an environment being used by an existing workload or template.
Using API¶
To view the available actions, go to the Data sources API reference.