Jupyter Notebook

Slice arrays

We saw how LaminDB allows to query & search across artifacts & collections using registries: Query & search registries.

Let us now look at the following case:

# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]  

Because the artifact was validated, querying the DataFrame is guaranteed to succeed!

Such within-collection queries are also possible for cloud-backed collections using DuckDB, TileDB, zarr, HDF5, parquet, and other storage backends.

In this notebook, we show how to subset an AnnData and generic HDF5 and zarr collections accessed in the cloud.

Let us create a remote instance for testing.

!lamin login testuser1
!lamin init --storage s3://lamindb-ci/test-arrays
Hide code cell output
 logged in with email testuser1@lamin.ai (uid: DzTjkKse)
! updating local SQLite & locking cloud SQLite (sync back & unlock: lamin disconnect)
 connected lamindb: testuser1/test-arrays

Import lamindb and track this notebook.

import lamindb as ln

ln.track("hsRyWJggf2Ca")
Hide code cell output
 connected lamindb: testuser1/test-arrays
 loaded Transform('hsRyWJggf2Ca0001'), re-started Run('LTicXyby...') at 2025-07-29 20:12:17 UTC
 notebook imports: lamindb==1.10.0

We’ll need some test data:

ln.Artifact("s3://lamindb-ci/test-arrays/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/test-arrays/testfile.hdf5").save()
Hide code cell output
 returning existing artifact with same hash: Artifact(uid='yAWLm86wv6LUv9N30000', is_latest=True, key='pbmc68k.h5ad', suffix='.h5ad', otype='AnnData', size=638484, hash='-QNUPBbAug3jFmmk3fsOQA', branch_id=1, space_id=1, storage_id=1, run_id=1, created_by_id=2, created_at=2025-07-25 11:01:04 UTC); to track this artifact as an input, use: ln.Artifact.get()
 returning existing artifact with same hash: Artifact(uid='dNXAfh0W3ZBO60ZK0000', is_latest=True, key='testfile.hdf5', suffix='.hdf5', size=1400, hash='UCWPjJkhzBjO97rtuo_8Yg', branch_id=1, space_id=1, storage_id=1, run_id=1, created_by_id=2, created_at=2025-07-25 11:01:04 UTC); to track this artifact as an input, use: ln.Artifact.get()
Artifact(uid='dNXAfh0W3ZBO60ZK0000', is_latest=True, key='testfile.hdf5', suffix='.hdf5', size=1400, hash='UCWPjJkhzBjO97rtuo_8Yg', branch_id=1, space_id=1, storage_id=1, run_id=1, created_by_id=2, created_at=2025-07-25 11:01:04 UTC)

Note that it is also possible to register Hugging Face paths. For this huggingface_hub package should be installed.

We register a folder of parquet files as a single artifact.

ln.Artifact("hf://datasets/Koncopd/lamindb-test/sharded_parquet").save()
Hide code cell output
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
 returning existing artifact with same hash: Artifact(uid='4HRzX1CgL1hdj5dx0000', is_latest=True, key='sharded_parquet', suffix='', size=42767, hash='oj6I3nNKj_eiX2I1q26qaw', n_files=11, branch_id=1, space_id=1, storage_id=2, run_id=2, created_by_id=1, created_at=2025-07-25 15:12:33 UTC); to track this artifact as an input, use: ln.Artifact.get()
Artifact(uid='4HRzX1CgL1hdj5dx0000', is_latest=True, key='sharded_parquet', suffix='', size=42767, hash='oj6I3nNKj_eiX2I1q26qaw', n_files=11, branch_id=1, space_id=1, storage_id=2, run_id=2, created_by_id=1, created_at=2025-07-25 15:12:33 UTC)

We also register a collection of individual parquet files.

artifact_shard1 = ln.Artifact(
    "hf://datasets/Koncopd/lamindb-test/sharded_parquet/louvain=0/947eee0b064440c9b9910ca2eb89e608-0.parquet"
).save()
artifact_shard2 = ln.Artifact(
    "hf://datasets/Koncopd/lamindb-test/sharded_parquet/louvain=1/947eee0b064440c9b9910ca2eb89e608-0.parquet"
).save()

ln.Collection(
    [artifact_shard1, artifact_shard2], key="sharded_parquet_collection"
).save()
Hide code cell output
 returning existing artifact with same hash: Artifact(uid='6hvTc8Te0DyFA6fQ0000', is_latest=True, key='sharded_parquet/louvain=0/947eee0b064440c9b9910ca2eb89e608-0.parquet', suffix='.parquet', size=4084, hash='o85GE0uiksFmcUy4f4_BCw', branch_id=1, space_id=1, storage_id=2, run_id=2, created_by_id=1, created_at=2025-07-25 15:12:33 UTC); to track this artifact as an input, use: ln.Artifact.get()
 returning existing artifact with same hash: Artifact(uid='xzODSboJ1NWDhzxP0000', is_latest=True, key='sharded_parquet/louvain=1/947eee0b064440c9b9910ca2eb89e608-0.parquet', suffix='.parquet', size=3915, hash='oBriQgo0Z8YrYvkG8UabFb', branch_id=1, space_id=1, storage_id=2, run_id=2, created_by_id=1, created_at=2025-07-25 15:12:33 UTC); to track this artifact as an input, use: ln.Artifact.get()
! returning existing collection with same hash: Collection(uid='uaQ9AOjHaR3aiP1B0000', is_latest=True, key='sharded_parquet_collection', hash='XavO_EEZSi-shT6uJGFHHA', branch_id=1, space_id=1, created_by_id=1, run_id=2, created_at=2025-07-25 15:12:33 UTC); if you intended to query to track this collection as an input, use: ln.Collection.get()
Collection(uid='uaQ9AOjHaR3aiP1B0000', is_latest=True, key='sharded_parquet_collection', hash='XavO_EEZSi-shT6uJGFHHA', branch_id=1, space_id=1, created_by_id=1, run_id=2, created_at=2025-07-25 15:12:33 UTC)

AnnData

An h5ad artifact stored on s3:

artifact = ln.Artifact.get(key="pbmc68k.h5ad")
artifact.path
S3QueryPath('s3://lamindb-ci/test-arrays/pbmc68k.h5ad')
adata = artifact.open()

This object is an AnnDataAccessor object, an AnnData object backed in the cloud:

adata
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 70 × 765
  constructed for the AnnData object pbmc68k.h5ad
    obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
    obsm: ['X_pca', 'X_umap']
    obsp: ['connectivities', 'distances']
    uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
    var: ['highly_variable', 'index', 'n_counts']
    varm: ['PCs']

Without subsetting, the AnnDataAccessor object references underlying lazy h5 or zarr arrays:

adata.X
Hide code cell output
<HDF5 dataset "X": shape (70, 765), type "<f4">

You can subset it like a normal AnnData object:

obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
  obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
  obsm: ['X_pca', 'X_umap']
  obsp: ['connectivities', 'distances']
  uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
  var: ['highly_variable', 'index', 'n_counts']
  varm: ['PCs']

Subsets load arrays into memory upon direct access:

adata_subset.X
Hide code cell output
array([[-0.326, -0.191,  0.499, ..., -0.21 , -0.636, -0.49 ],
       [ 0.811, -0.191, -0.728, ..., -0.21 ,  0.604, -0.49 ],
       [-0.326, -0.191,  0.643, ..., -0.21 ,  2.303, -0.49 ],
       ...,
       [-0.326, -0.191, -0.728, ..., -0.21 ,  0.626, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
      shape=(35, 765), dtype=float32)

To load the entire subset into memory as an actual AnnData object, use to_memory():

adata_subset.to_memory()
Hide code cell output
AnnData object with n_obs × n_vars = 35 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

SpatialData

It is also possible to access AnnData objects inside SpatialData tables:

artifact = ln.Artifact.using("laminlabs/lamindata").get(
    key="visium_aligned_guide_min.zarr"
)

access = artifact.open()
 mapped: Artifact(uid='bjH534dxVi1drmLZ0001')
access
Hide code cell output
SpatialDataAccessor object
  constructed for the SpatialData object bjH534dxVi1drmLZ.zarr
    with tables: ['table', 'table']
access.tables
Hide code cell output
Accessor for the SpatialData attribute tables
  with keys: ['table', 'table']

This gives you the same AnnDataAccessor object as for a normal AnnData.

table = access.tables["table"]

table
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 37 × 18085
  constructed for the AnnData object table
    obs: ['_index', 'array_col', 'array_row', 'clone', 'dataset', 'in_tissue', 'region', 'spot_id']
    obsm: ['spatial']
    uns: ['spatial', 'spatialdata_attrs']
    var: ['feature_types', 'gene_ids', 'genome', 'symbols']

You can subset it and read into memory as an actual AnnData:

table_subset = table[table.obs["clone"] == "diploid"]

table_subset
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 31 × 18085
  obs: ['_index', 'array_col', 'array_row', 'clone', 'dataset', 'in_tissue', 'region', 'spot_id']
  obsm: ['spatial']
  uns: ['spatial', 'spatialdata_attrs']
  var: ['feature_types', 'gene_ids', 'genome', 'symbols']
adata = table_subset.to_memory()

Generic HDF5

Let us query a generic HDF5 artifact:

artifact = ln.Artifact.get(key="testfile.hdf5")

And get a backed accessor:

backed = artifact.open()

The returned object contains the .connection and h5py.File or zarr.Group in .storage

backed
BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/test-arrays/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
backed.storage
<HDF5 file "testfile.hdf5>" (mode r)>

Parquet

A dataframe stored as sharded parquet.

artifact = ln.Artifact.get(key="sharded_parquet")
artifact.path.view_tree()
Hide code cell output
11 sub-directories & 11 files with suffixes '.parquet'
hf://datasets/Koncopd/lamindb-test/sharded_parquet
├── louvain=0/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=1/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=10/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=2/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=3/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=4/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=5/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=6/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=7/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=8/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
└── louvain=9/
    └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
backed = artifact.open()

This returns a pyarrow dataset.

backed
<pyarrow._dataset.FileSystemDataset at 0x7f79e051fca0>
backed.head(5).to_pandas()
Hide code cell output
cell_type n_genes percent_mito
index
CGTTATACAGTACC-8 CD4+/CD45RO+ Memory 1034 0.010163
AGATATTGACCACA-1 CD4+/CD45RO+ Memory 1078 0.012831
GCAGGGCTGTATGC-8 CD8+/CD45RA+ Naive Cytotoxic 1055 0.012287
TTATGGCTGGCAAG-2 CD4+/CD25 T Reg 1236 0.023963
CACGACCTGGGAGT-7 CD4+/CD25 T Reg 1010 0.016620

It is also possible to open a collection of cloud artifacts.

collection = ln.Collection.get(key="sharded_parquet_collection")
backed = collection.open()
backed
<pyarrow._dataset.FileSystemDataset at 0x7f79e051c5e0>
backed.to_table().to_pandas()
Hide code cell output
cell_type n_genes percent_mito
index
CGTTATACAGTACC-8 CD4+/CD45RO+ Memory 1034 0.010163
AGATATTGACCACA-1 CD4+/CD45RO+ Memory 1078 0.012831
GCAGGGCTGTATGC-8 CD8+/CD45RA+ Naive Cytotoxic 1055 0.012287
TTATGGCTGGCAAG-2 CD4+/CD25 T Reg 1236 0.023963
CACGACCTGGGAGT-7 CD4+/CD25 T Reg 1010 0.016620
AATCTCACTCAGTG-3 CD4+/CD45RO+ Memory 1183 0.016056
CTAGTTTGGCTTAG-4 CD4+/CD45RO+ Memory 1002 0.018922
ACGCCGGAAGCCTA-6 CD8+/CD45RA+ Naive Cytotoxic 1292 0.018315
CTGACCACCATGGT-4 CD8+/CD45RA+ Naive Cytotoxic 1559 0.024427
AGTTAAACAAACAG-1 CD19+ B 1005 0.019806
CTACGCACAGGGTG-3 CD4+/CD45RO+ Memory 1053 0.012073
CAGACAACAAAACG-7 CD4+/CD25 T Reg 1109 0.012702
GAGGGTGACCTATT-1 CD4+/CD25 T Reg 1003 0.012971
TGACTGGAACCATG-7 Dendritic cells 1277 0.012961
ACGACCCTGTCTGA-3 Dendritic cells 1074 0.017466
GTTATGCTACCTCC-3 CD14+ Monocytes 1201 0.016839
GTGTCAGATCTACT-6 CD14+ Monocytes 1014 0.025417
AAGAACGAACTCTT-6 CD14+ Monocytes 1067 0.019530
TACTCTGACGTAGT-1 Dendritic cells 1118 0.012069
TAAGCTCTTCTGGA-4 CD14+ Monocytes 1059 0.021497

By default Artifact.open() and Collection.open() use pyarrow to lazily open dataframes. polars can be also used by passing engine="polars". Note also that .open(engine="polars") returns a context manager with LazyFrame.

with collection.open(engine="polars") as lazy_df:
    display(lazy_df.collect().to_pandas())
Hide code cell output
sys:1: CategoricalRemappingWarning: Local categoricals have different encodings, expensive re-encoding is done to perform this merge operation. Consider using a StringCache or an Enum type if the categories are known in advance
cell_type n_genes percent_mito index
0 CD4+/CD45RO+ Memory 1034 0.010163 CGTTATACAGTACC-8
1 CD4+/CD45RO+ Memory 1078 0.012831 AGATATTGACCACA-1
2 CD8+/CD45RA+ Naive Cytotoxic 1055 0.012287 GCAGGGCTGTATGC-8
3 CD4+/CD25 T Reg 1236 0.023963 TTATGGCTGGCAAG-2
4 CD4+/CD25 T Reg 1010 0.016620 CACGACCTGGGAGT-7
5 CD4+/CD45RO+ Memory 1183 0.016056 AATCTCACTCAGTG-3
6 CD4+/CD45RO+ Memory 1002 0.018922 CTAGTTTGGCTTAG-4
7 CD8+/CD45RA+ Naive Cytotoxic 1292 0.018315 ACGCCGGAAGCCTA-6
8 CD8+/CD45RA+ Naive Cytotoxic 1559 0.024427 CTGACCACCATGGT-4
9 CD19+ B 1005 0.019806 AGTTAAACAAACAG-1
10 CD4+/CD45RO+ Memory 1053 0.012073 CTACGCACAGGGTG-3
11 CD4+/CD25 T Reg 1109 0.012702 CAGACAACAAAACG-7
12 CD4+/CD25 T Reg 1003 0.012971 GAGGGTGACCTATT-1
13 Dendritic cells 1277 0.012961 TGACTGGAACCATG-7
14 Dendritic cells 1074 0.017466 ACGACCCTGTCTGA-3
15 CD14+ Monocytes 1201 0.016839 GTTATGCTACCTCC-3
16 CD14+ Monocytes 1014 0.025417 GTGTCAGATCTACT-6
17 CD14+ Monocytes 1067 0.019530 AAGAACGAACTCTT-6
18 Dendritic cells 1118 0.012069 TACTCTGACGTAGT-1
19 CD14+ Monocytes 1059 0.021497 TAAGCTCTTCTGGA-4

Yet another way to open several parquet files as a single dataset is via calling .open() directly for a query set.

backed = ln.Artifact.filter(suffix=".parquet").open()
! this query set is unordered, consider using `.order_by()` first to avoid opening the artifacts in an arbitrary order
backed
<pyarrow._dataset.FileSystemDataset at 0x7f79dc49a680>
# clean up test instance
!lamin delete --force test-arrays
Hide code cell output
╭─ Error ──────────────────────────────────────────────────────────────────────╮
 's3://lamindb-ci/test-arrays/.lamindb' contains 3 objects - delete them      
 prior to deleting the storage location                                       
╰──────────────────────────────────────────────────────────────────────────────╯