Fundamentals

RandBLAS has a polymorphic free-function API. We have spent a significant amount of effort on minimizing the number of RandBLAS–specific datastructures needed in order to achieve that polymorphism.

We have a bunch of functions that aren’t documented on this website. If such a function looks useful, you should feel free to use it. If you end up doing that, and you care about your code’s compatibility with future versions of RandBLAS, then please let us know by filing a quick GitHub issue.

Preliminaries

Dense sketching, with Gaussians et al.

DenseDist : a distribution over matrices with i.i.d., mean-zero, variance-one entries

enum class RandBLAS::ScalarDist : char

Names for mean-zero variance-one distributions on the reals.

Values:

enumerator Gaussian: The Gaussian distribution with mean 0 and variance 1.

enumerator Uniform: The uniform distribution over \({[-r, r],}\) where \({r := \sqrt{3}}\) provides for a variance of 1.

struct DenseDist

A distribution over matrices whose entries are iid mean-zero variance-one random variables. This type conforms to the SketchingDistribution concept.

Public Functions

inline DenseDist(int64_t n_rows, int64_t n_cols, ScalarDist family = ScalarDist::Gaussian, Axis major_axis = Axis::Long)

Arguments passed to this function are used to initialize members of the same names. The members \({\texttt{dim_major},}\) \({\texttt{dim_minor},}\) \({\texttt{isometry_scale},}\) and \({\texttt{natural_layout}}\) are automatically initialized to be consistent with these arguments.

This constructor will raise an error if \({\min\{\texttt{n_rows}, \texttt{n_cols}\} \leq 0.}\)

Public Members

const int64_t n_rows: Matrices drawn from this distribution have this many rows.

const int64_t n_cols: Matrices drawn from this distribution have this many columns.

const Axis major_axis

This member affects whether samples from this distribution have their entries filled row-wise or column-wise. While there is no statistical difference between these two filling orders, there are situations where one order or the other might be preferred.

For more information, see the DenseDist::natural_layout and the section of the RandBLAS tutorial on updating sketches.

const int64_t dim_major

Defined as

\[\begin{split}\texttt{dim_major} = \begin{cases} \,\min\{ \texttt{n_rows},\, \texttt{n_cols} \} &\text{ if }~~ \texttt{major_axis} = \texttt{Short} \\ \max\{ \texttt{n_rows},\,\texttt{n_cols} \} & \text{ if } ~~\texttt{major_axis} = \texttt{Long} \end{cases}.\end{split}\]

const int64_t dim_minor: Defined as \({\texttt{n_rows} + \texttt{n_cols} - \texttt{dim_major}.}\) This is just whichever of \({(\texttt{n_rows}, \texttt{n_cols})}\) wasn’t identified as \({\texttt{dim_major}.}\)

const double isometry_scale: A sketching operator sampled from this distribution should be multiplied by this constant in order for sketching to preserve norms in expectation.

const ScalarDist family: The distribution on \(\mathbb{R}\) for entries of operators sampled from this distribution.

const blas::Layout natural_layout

The fill order (row major or column major) implied by major_axis, n_rows, and n_cols, according to the following table.

\(\texttt{major_axis} = \texttt{Long}\)

\(\texttt{major_axis} = \texttt{Short}\)

\(\texttt{n_rows} > \texttt{n_cols}\)

column major

row major

\(\texttt{n_rows} \leq \texttt{n_cols}\)

row major

column major

If you want to sample a dense sketching operator represented as buffer in a layout different than the one given here, then a change-of-layout has to be performed explicitly.

DenseSkOp : a sample from a DenseDist

template<typename T, typename RNG = DefaultRNG> struct DenseSkOp

A sample from a distribution over matrices whose entries are iid mean-zero variance-one random variables. This type conforms to the SketchingOperator concept.

Public Types

using distribution_t = DenseDist : Type alias.

using state_t = RNGState<RNG>: Type alias.

using scalar_t = T : Real scalar type used in matrix representations of this operator.

Public Functions

inline DenseSkOp(DenseDist dist, const state_t &seed_state)

Arguments passed to this function are used to initialize members of the same names. \({\texttt{own_memory}}\) is initialized to true, \({\texttt{buff}}\) is initialized to nullptr, and \({\texttt{layout}}\) is initialized to \({\texttt{dist.natural_layout}.}\) The \({\texttt{next_state}}\) member is computed automatically from \({\texttt{dist}}\) and \({\texttt{next_state}.}\)

Although \({\texttt{own_memory}}\) is initialized to true, RandBLAS will not attach memory to \({\texttt{buff}}\) unless fill_dense(DenseSkOp &S) is called.

If a RandBLAS function needs an explicit representation of this operator and yet \({\texttt{buff}}\) is null, then RandBLAS will construct a temporary explicit representation of this operator and delete that representation before returning.

Public Members

const DenseDist dist: The distribution from which this operator is sampled; this member specifies the number of rows and columns of this operator.

const state_t seed_state: The state that should be passed to the RNG when the full operator needs to be sampled from scratch.

const state_t next_state: The state that should be used in the next call to a random sampling function whose output should be statistically independent from properties of this operator.

const int64_t n_rows: Alias for dist.n_rows.

const int64_t n_cols: Alias for dist.n_cols.

bool own_memory

If true, then RandBLAS has permission to allocate and attach memory to this operator’s \({\texttt{buff}}\) member. If true at destruction time, then delete [] will be called on \({\texttt{buff}}\) if it is non-null.

RandBLAS only writes to this member at construction time.

T *buff = nullptr

Reference to an array that holds the explicit representation of this operator as a dense matrix.

If non-null this must point to an array of length at least \({\texttt{dist.n_cols * dist.n_rows},}\) and this array must contain the random samples from \({\texttt{dist}}\) implied by \({\texttt{seed_state}.}\) See DenseSkOp::layout for more information.

const blas::Layout layout: The storage order for \({\texttt{buff}.}\) The leading dimension of \({\operatorname{mat}(\texttt{buff})}\) when reading from \({\texttt{buff}}\) is \({\texttt{dist.dim_major}.}\)

template<typename DenseSkOp> void RandBLAS::fill_dense(DenseSkOp &S)

If \({\texttt{S.own_memory}}\) is true then we enter an allocation stage. If \({\texttt{S.buff}}\) is equal to \({\texttt{nullptr}}\) then it is redirected to the start of an new array (allocated with new []) of length \({\texttt{S.n_rows * S.n_cols}.}\)

After the allocation stage, we check \({\texttt{S.buff}}\) and we raise an error if it’s null.

If \({\texttt{S.buff}}\) is are non-null, then we’ll assume it has length at least \({\texttt{S.n_rows * S.n_cols}.}\) We’ll proceed to populate \({\texttt{S.buff}}\) with the data for the explicit representation of \({\texttt{S}.}\) On exit, one can encode a BLAS-style representation of \({\texttt{S}}\) with the tuple

\[(\texttt{S.layout},~\texttt{S.n_rows},~\texttt{S.n_cols},~\texttt{S.buff},~\texttt{S.dist.dim_major})\]

In BLAS parlance, the last element of this tuple would be called the “leading dimension” of \(\texttt{S}.\)

template<typename T, typename RNG = DefaultRNG> RNGState<RNG> RandBLAS::fill_dense_unpacked(blas::Layout layout, const DenseDist &D, int64_t n_rows, int64_t n_cols, int64_t ro_s, int64_t co_s, T *buff, const RNGState<RNG> &seed)

This function provides the underlying implementation of fill_dense(DenseSkOp &S). Unlike fill_dense(DenseSkOp &S), this function can be used to generate explicit representations of submatrices of dense sketching operators.

Formally, if \({\mathbf{S}}\) were sampled from \({\mathcal{D}}\) with \({\texttt{seed_state=seed}}\) , then on exit we’d have

\[\operatorname{mat}(\mathtt{buff})_{ij} = \mathbf{S}_{(i+\mathtt{ro\_s})(\mathtt{co\_s} + j)}\]

where \(\operatorname{mat}(\cdot)\) reads from \(\mathtt{buff}\) in \(\mathtt{layout}\) order.

If \({\texttt{layout != dist.natural_layout}}\) then this function internally allocates \({\texttt{n_rows * n_cols}}\) words of workspace, and deletes this workspace before returning.

Sparse sketching, with CountSketch et al.

SparseDist : a distribution over structured sparse matrices

struct SparseDist

A distribution over matrices with structured sparsity. Depending on parameter choices, one can obtain distributions described in the literature as SJLTs, OSNAPs, hashing embeddings, CountSketch, row or column sampling, or LESS-Uniform distributions. All members of a SparseDist are const.

Public Functions

inline SparseDist(int64_t n_rows, int64_t n_cols, int64_t vec_nnz = 4, Axis major_axis = Axis::Short)

Arguments passed to this function are used to initialize members of the same names. The members \({\texttt{dim_major},}\) \({\texttt{dim_minor},}\) \({\texttt{isometry_scale},}\) and \({\texttt{full_nnz}}\) are automatically initialized to be consistent with these arguments.

This constructor will raise an error if \({\min\{\texttt{n_rows}, \texttt{n_cols}\} \leq 0}\) or if \({\texttt{vec_nnz}}\) does not respect the bounds documented for the \({\texttt{vec_nnz}}\) member.

Public Members

const int64_t n_rows: Matrices drawn from this distribution have this many rows; must be greater than zero.

const int64_t n_cols: Matrices drawn from this distribution have this many columns; must be greater than zero.

const Axis major_axis

Operators sampled from this distribution are constructed by taking independent samples from a suitable distribution \({\mathcal{V}}\) over sparse vectors. This distribution is always over \({\mathbb{R}^k,}\) where \({k = \texttt{dim_major}.}\) The structural properties of \({\mathcal{V}}\) depend heavily on whether we’re short-axis major or long-axis major.

To be explicit, let’s say that \({\mathbf{x}}\) is a sample from \({\mathcal{V}.}\)

If \({\texttt{major_axis} = \texttt{Short}}\) , then \({\mathbf{x}}\) has exactly \({\texttt{vec_nnz}}\) nonzeros, and the locations of those nonzeros are chosen uniformly without replacement from \({\{0,\ldots,k-1\}.}\) The values of the nonzeros are sampled independently and uniformly from +/- 1.

If \({\texttt{major_axis} = \texttt{Long}}\) , then \({\mathbf{x}}\) has at most \({\texttt{vec_nnz}}\) nonzero entries. The locations of the nonzeros are determined by sampling uniformly with replacement from \({\{0,\ldots,k-1\}.}\) If index \({j}\) occurs in the sample \({\ell}\) times, then \({\mathbf{x}_j}\) will equal \({\sqrt{\ell}}\) with probability 1/2 and \({-\sqrt{\ell}}\) with probability 1/2.

const int64_t dim_major

Defined as

\[\begin{split}\texttt{dim_major} = \begin{cases} \,\min\{ \texttt{n_rows},\, \texttt{n_cols} \} &\text{ if }~~ \texttt{major_axis} = \texttt{Short} \\ \max\{ \texttt{n_rows},\,\texttt{n_cols} \} & \text{ if } ~~\texttt{major_axis} = \texttt{Long} \end{cases}.\end{split}\]

const int64_t dim_minor: Defined as \({\texttt{n_rows} + \texttt{n_cols} - \texttt{dim_major}.}\) This is just whichever of \({(\texttt{n_rows},, \texttt{n_cols})}\) wasn’t identified as \({\texttt{dim_major}.}\)

const double isometry_scale: An operator sampled from this distribution should be multiplied by this constant in order for sketching to preserve norms in expectation.

const int64_t vec_nnz: This constrains the number of nonzeros in each major-axis vector. It’s subject to the bounds \({1 \leq \texttt{vec_nnz} \leq \texttt{dim_major}.}\) See this tutorial page for advice on how to set this member.

const int64_t full_nnz: An upper bound on the number of structural nonzeros that can appear in an operator sampled from this distribution. Computed automatically as \({\texttt{full_nnz} = \texttt{vec_nnz} * \texttt{dim_minor}.}\)

SparseSkOp : a sample from a SparseDist

template<typename T, typename RNG = DefaultRNG, SignedInteger sint_t = int64_t> struct SparseSkOp

A sample from a distribution over structured sparse matrices with either independent rows or independent columns. This type conforms to the SketchingOperator concept.

Public Types

using distribution_t = SparseDist : Type alias.

using state_t = RNGState<RNG>: Type alias.

using scalar_t = T : Real scalar type used for nonzeros in matrix representations of this operator.

using index_t = sint_t : Signed integer type used in index arrays for sparse matrix representations of this operator.

Public Functions

inline SparseSkOp(SparseDist dist, const state_t &seed_state)

Standard constructor. Arguments passed to this function are used to initialize members of the same names. own_memory is initialized to true, nnz is initialized to -1, and (vals, rows, cols) are each initialized to nullptr. next_state is computed automatically from dist and seed_state.

Although own_memory is initialized to true, RandBLAS will not attach memory to (vals, rows, cols) unless fill_sparse(SparseSkOp &S) is called.

If a RandBLAS function needs an explicit representation of this operator and yet nnz < 0, then RandBLAS will construct a temporary explicit representation of this operator and delete that representation before returning.

inline SparseSkOp(SparseDist dist, const state_t &seed_state, const state_t &next_state, int64_t nnz, T *vals, sint_t *rows, sint_t *cols): Expert constructor. Arguments passed to this function are used to initialize members of the same names. own_memory is initialized to false.

Public Members

const SparseDist dist: The distribution from which this operator is sampled.

const state_t seed_state: The state passed to random sampling functions when the full operator needs to be sampled from scratch.

const state_t next_state: The state that should be used in the next call to a random sampling function whose output should be statistically independent from properties of this operator.

const int64_t n_rows: Alias for dist.n_rows. Automatically initialized in all constructors.

const int64_t n_cols: Alias for dist.n_cols. Automatically initialized in all constructors.

bool own_memory

If true, then RandBLAS has permission to allocate and attach memory to this operator’s reference members (S.rows, S.cols, and S.vals). If true at destruction time, then delete [] will be called on each of this operator’s non-null reference members.

RandBLAS only writes to this member at construction time.

int64_t nnz: The number of structural nonzeros in this operator. Negative values are a flag that the operator’s explicit representation hasn’t been sampled yet.

T *vals

Reference to an array that holds the values of this operator’s structural nonzeros.

If non-null, this must point to an array of length at least dist.full_nnz.

sint_t *rows

Reference to an array that holds the row indices for this operator’s structural nonzeros.

If non-null, this must point to an array of length at least dist.full_nnz.

sint_t *cols

Reference to an array that holds the column indices for this operator’s structural nonzeros.

If non-null, this must point to an array of length at least dist.full_nnz.

template<typename SparseSkOp> void RandBLAS::fill_sparse(SparseSkOp &S)

If \({\texttt{S.own_memory}}\) is true then we enter an allocation stage. This stage inspects the reference members of \({\texttt{S}}\) . Any reference member that’s equal to \({\texttt{nullptr}}\) is redirected to the start of a new array (allocated with new []) of length \({\texttt{S.dist.full_nnz}.}\)

After the allocation stage, we inspect the reference members of \({\texttt{S}}\) and we raise an error if any of them are null.

If all reference members are are non-null, then we’ll assume each of them has length at least \({\texttt{S.dist.full_nnz}.}\) We’ll proceed to populate those members (and \({\texttt{S.nnz}}\) ) with the data for the explicit representation of \({\texttt{S}.}\) On exit, \({\texttt{S}}\) can be equivalently represented by

RandBLAS::COOMatrix mat(S.n_rows, S.n_cols, S.nnz, S.vals, S.rows, S.cols);

template<typename T, typename sint_t, typename state_t> state_t RandBLAS::fill_sparse_unpacked_nosub(const SparseDist &D, int64_t &nnz, T *vals, sint_t *rows, sint_t *cols, const state_t &seed_state)

This function is the underlying implementation of fill_sparse(SparseSkOp &S). It has no allocation stage and it skips checks for null pointers.

On entry, \({(\mathtt{vals},\mathtt{rows},\mathtt{cols})}\) are arrays of length at least \({{\mathcal{D}\mathtt{.full\_nnz}}.}\) On exit, the first \({\texttt{nnz}}\) entries of these arrays contain the data for a COO sparse matrix representation of the SparseSkOp defined by \({(\mathcal{D},\texttt{seed_state)}.}\)

Note: the “nosub” suffix in this function name reflects how it has no ability to sample submatrices of sparse sketching operators. A future release of RandBLAS will add a function called “fill_sparse_unpacked()” with capabilities that are completely analogous to fill_dense_unpacked().

The unifying (C++20) concepts

SketchingDistribution

template<typename SkDist> concept SketchingDistribution

#include <base.hh>

Mathematical description

Matrices sampled from sketching distributions in RandBLAS are mean-zero and have covariance matrices that are proportional to the identity.

Formally, if \(\mathcal{D}\) is a distribution over \(r \times c\) matrices and \(\mathbf{S}\) is a sample from \(\mathcal{D}\), then \(\mathbb{E}\mathbf{S} = \mathbf{0}_{r \times c}\) and

\begin{gather} \theta^2 \cdot \mathbb{E}\left[ \mathbf{S}^T\mathbf{S} \right]=\mathbf{I}_{c \times c}& \nonumber \\ \,\phi^2 \cdot \mathbb{E}\left[ \mathbf{S}{\mathbf{S}}^T\, \right]=\mathbf{I}_{r \times r}& \nonumber \end{gather}

hold for some \(\theta > 0\) and \(\phi > 0\).

The isometry scale of the distribution is \(\alpha := \theta\) if \(c \geq r\) and \(\alpha := \phi\) otherwise. If you want to sketch in a way that preserves squared norms in expectation, then you should sketch with a scaled sample \(\alpha \mathbf{S}\) rather than the sample itself.

Programmatic description

A variable \(\texttt{D}\) of a type that conforms to the \(\texttt{SketchingDistribution}\) concept has the following attributes.

	type	description
\(\texttt{D.n_rows}\)	\(\texttt{const int64_t}\)	samples from \(\texttt{D}\) have this many rows
\(\texttt{D.n_cols}\)	\(\texttt{const int64_t}\)	samples from \(\texttt{D}\) have this many columns
\(\texttt{D.isometry_scale}\)	\(\texttt{const double}\)	See above.

Note that the isometry scale is always stored in double precision; this has no bearing on the precision of sketching operators that are sampled from a \(\texttt{SketchingDistribution}\).

Notes

RandBLAS has two SketchingDistribution types: DenseDist and SparseDist. These types have members called called “major_axis,” “dim_major,” and “dim_minor.” These members have similar semantic roles across the two classes, but their precise meanings differ significantly.

\(\texttt{SKOP::distribution_t}\)	A type conforming to the SketchingDistribution concept.
\(\texttt{SKOP::state_t}\)	A template instantiation of RNGState.
\(\texttt{SKOP::scalar_t}\)	Real scalar type used in matrix representations of \(\texttt{SKOP}\text{s}.\)

\(\texttt{S.dist}\)	\(\texttt{const distribution_t}\)	Distribution from which this operator is sampled.
\(\texttt{S.n_rows}\)	\(\texttt{const int64_t}\)	An alias for \(\texttt{S.dist.n_rows}.\)
\(\texttt{S.n_cols}\)	\(\texttt{const int64_t}\)	An alias for \(\texttt{S.dist.n_cols}.\)
\(\texttt{S.seed_state}\)	\(\texttt{const state_t}\)	RNGState used to construct an explicit representation of \(\texttt{S}\).
\(\texttt{S.next_state}\)	\(\texttt{const state_t}\)	An RNGState that can be used in a call to a random sampling routine whose output should be statistically independent from \(\texttt{S}.\)
\(\texttt{S.own_memory}\)	\(\texttt{bool}\)	A flag used to indicate whether internal functions have permission to attach memory to \(\texttt{S},\) and whether the destructor of \(\texttt{S}\) has the responsibility to delete any memory that’s attached to \(\texttt{S}.\)