biom.table.Table.from_hdf5

classmethod Table.from_hdf5(h5grp, ids=None, axis='sample', parse_fs=None, subset_with_metadata=True)

Parse an HDF5 formatted BIOM table

If ids is provided, only the samples/observations listed in ids (depending on the value of axis) will be loaded

The expected structure of this group is below. A few basic definitions, N is the number of observations and M is the number of samples. Data are stored in both compressed sparse row (for observation oriented operations) and compressed sparse column (for sample oriented operations).

Parameters:
h5grpa h5py Group or an open h5py File

The object to load from

idsiterable

The sample/observation ids of the samples/observations that we need to retrieve from the hdf5 biom table

axis‘sample’, ‘observation’, optional

The axis to subset on

parse_fsdict, optional

Specify custom parsing functions for metadata fields. This dict is expected to be {‘metadata_field’: function}, where the function signature is (object) corresponding to a single row in the associated metadata dataset. The return from this function an object as well, and is the parsed representation of the metadata.

subset_with_metadatabool, optional

When subsetting (i.e., ids is not None), whether to also parse the metadata. By default, the metadata are also subset. The reason for exposing this functionality is that, for large tables, there exists a very large overhead for this metadata manipulation.

Returns:
biom.Table

A BIOM Table object

Raises:
ValueError

If ids are not a subset of the samples or observations ids present in the hdf5 biom table If h5grp is not a HDF5 file or group

See also

Table.to_hdf5

Notes

The expected HDF5 group structure is below. An example of an HDF5 file in DDL can be found here [1].

  • ./id : str, an arbitrary ID # noqa

  • ./type : str, the table type (e.g, OTU table) # noqa

  • ./format-url : str, a URL that describes the format # noqa

  • ./format-version : two element tuple of int32, major and minor # noqa

  • ./generated-by : str, what generated this file # noqa

  • ./creation-date : str, ISO format # noqa

  • ./shape : two element tuple of int32, N by M # noqa

  • ./nnz : int32 or int64, number of non zero elems # noqa

  • ./observation : Group # noqa

  • ./observation/ids : (N,) dataset of str or vlen str # noqa

  • ./observation/matrix : Group # noqa

  • ./observation/matrix/data : (nnz,) dataset of float64 # noqa

  • ./observation/matrix/indices : (nnz,) dataset of int32 # noqa

  • ./observation/matrix/indptr : (M+1,) dataset of int32 # noqa

  • ./observation/metadata : Group # noqa

  • [./observation/metadata/foo] : Optional, (N,) dataset of any valid HDF5 type in index order with IDs. # noqa

  • ./observation/group-metadata : Group # noqa

  • [./observation/group-metadata/foo] : Optional, (?,) dataset of group metadata that relates IDs # noqa

  • [./observation/group-metadata/foo.attrs[‘data_type’]] : attribute of the foo dataset that describes contained type (e.g., newick) # noqa

  • ./sample : Group # noqa

  • ./sample/ids : (M,) dataset of str or vlen str # noqa

  • ./sample/matrix : Group # noqa

  • ./sample/matrix/data : (nnz,) dataset of float64 # noqa

  • ./sample/matrix/indices : (nnz,) dataset of int32 # noqa

  • ./sample/matrix/indptr : (N+1,) dataset of int32 # noqa

  • ./sample/metadata : Group # noqa

  • [./sample/metadata/foo] : Optional, (M,) dataset of any valid HDF5 type in index order with IDs. # noqa

  • ./sample/group-metadata : Group # noqa

  • [./sample/group-metadata/foo] : Optional, (?,) dataset of group metadata that relates IDs # noqa

  • [./sample/group-metadata/foo.attrs[‘data_type’]] : attribute of the foo dataset that describes contained type (e.g., newick) # noqa

The ‘?’ character on the dataset size means that it can be of arbitrary length.

The expected structure for each of the metadata datasets is a list of atomic type objects (int, float, str, …), where the index order of the list corresponds to the index order of the relevant axis IDs. Special metadata fields have been defined, and they are stored in a specific way. Currently, the available special metadata fields are:

  • taxonomy: (N, ?) dataset of str or vlen str

  • KEGG_Pathways: (N, ?) dataset of str or vlen str

  • collapsed_ids: (N, ?) dataset of str or vlen str

References

Examples

>>> from biom.table import Table
>>> from biom.util import biom_open
>>> with biom_open('rich_sparse_otu_table_hdf5.biom') as f 
>>>     t = Table.from_hdf5(f) 

Parse a hdf5 biom table subsetting observations >>> from biom.util import biom_open # doctest: +SKIP >>> from biom.parse import parse_biom_table >>> with biom_open(‘rich_sparse_otu_table_hdf5.biom’) as f # doctest: +SKIP >>> t = Table.from_hdf5(f, ids=[“GG_OTU_1”], … axis=’observation’) # doctest: +SKIP