Fundamentals
RandBLAS has a polymorphic free-function API. We have spent a significant amount of effort on minimizing the number of RandBLAS–specific datastructures needed in order to achieve that polymorphism.
We have a bunch of functions that aren’t documented on this website. If such a function looks useful, you should feel free to use it. If you end up doing that, and you care about your code’s compatibility with future versions of RandBLAS, then please let us know by filing a quick GitHub issue.
Preliminaries
The Axis enum
-
enum class RandBLAS::Axis : char
Sketching operators are only “useful” for dimension reduction if they’re non-square.
The larger dimension of a sketching operator has a different semantic role than the small dimension. This enum provides a way for us to refer to the larger or smaller dimension in a way that’s agnostic to whether the sketching operator is wide or tall.
For a wide matrix, its short-axis vectors are its columns, and its long-axis vectors are its rows.
For a tall matrix, its short-axis vectors are its rows, and its long-axis vectors are its columns.
Values:
-
enumerator Short
-
enumerator Long
-
enumerator Short
RNGState
-
template<typename RNG = DefaultRNG>
struct RNGState A representation of the state of a counter-based random number generator (CBRNG) defined in Random123. The representation consists of two arrays: the counter and the key. The arrays’ types are statically sized, small (typically of length 2 or 4), and can be distinct from one another.
The template parameter RNG is a CBRNG type in defined in Random123. We’ve found that Philox-based CBRNGs work best for our purposes, but we also support Threefry-based CBRNGS.
Public Types
Dense sketching, with Gaussians et al.
DenseDist : a distribution over matrices with i.i.d., mean-zero, variance-one entries
-
enum class RandBLAS::ScalarDist : char
We support two distributions for dense sketching operators: those whose entries are iid Gaussians or iid uniform over a symmetric interval.
Values:
-
enumerator Gaussian
Indicates the Gaussian distribution with mean 0 and variance 1.
-
enumerator Uniform
Indicates the uniform distribution over [-r, r] where r := sqrt(3) is the radius that provides for a variance of 1.
-
enumerator Gaussian
-
struct DenseDist
A distribution over matrices whose entries are iid mean-zero variance-one random variables. This type conforms to the SketchingDistribution concept.
Public Functions
-
inline DenseDist(int64_t n_rows, int64_t n_cols, ScalarDist family = ScalarDist::Gaussian, Axis major_axis = Axis::Long)
Arguments passed to this function are used to initialize members of the same names. The members \({\texttt{dim_major},}\) \({\texttt{dim_minor},}\) \({\texttt{isometry_scale},}\) and \({\texttt{natural_layout}}\) are automatically initialized to be consistent with these arguments.
This constructor will raise an error if \({\min\{\texttt{n_rows}, \texttt{n_cols}\} \leq 0.}\)
Public Members
-
const int64_t n_rows
Matrices drawn from this distribution have this many rows.
-
const int64_t n_cols
Matrices drawn from this distribution have this many columns.
-
const Axis major_axis
This member affects whether samples from this distribution have their entries filled row-wise or column-wise. While there is no statistical difference between these two filling orders, there are situations where one order or the other might be preferred.
For more information, see the DenseDist::natural_layout and the section of the RandBLAS tutorial on updating sketches.
-
const int64_t dim_major
Defined as
\[\begin{split}\texttt{dim_major} = \begin{cases} \,\min\{ \texttt{n_rows},\, \texttt{n_cols} \} &\text{ if }~~ \texttt{major_axis} = \texttt{Short} \\ \max\{ \texttt{n_rows},\,\texttt{n_cols} \} & \text{ if } ~~\texttt{major_axis} = \texttt{Long} \end{cases}.\end{split}\]
-
const int64_t dim_minor
Defined as \({\texttt{n_rows} + \texttt{n_cols} - \texttt{dim_major}.}\) This is just whichever of \({(\texttt{n_rows}, \texttt{n_cols})}\) wasn’t identified as \({\texttt{dim_major}.}\)
-
const double isometry_scale
A sketching operator sampled from this distribution should be multiplied by this constant in order for sketching to preserve norms in expectation.
-
const ScalarDist family
The distribution on \(\mathbb{R}\) for entries of operators sampled from this distribution.
-
const blas::Layout natural_layout
The fill order (row major or column major) implied by major_axis, n_rows, and n_cols, according to the following table.
\(\texttt{major_axis} = \texttt{Long}\)
\(\texttt{major_axis} = \texttt{Short}\)
\(\texttt{n_rows} > \texttt{n_cols}\)
column major
row major
\(\texttt{n_rows} \leq \texttt{n_cols}\)
row major
column major
If you want to sample a dense sketching operator represented as buffer in a layout different than the one given here, then a change-of-layout has to be performed explicitly.
-
inline DenseDist(int64_t n_rows, int64_t n_cols, ScalarDist family = ScalarDist::Gaussian, Axis major_axis = Axis::Long)
DenseSkOp : a sample from a DenseDist
-
template<typename T, typename RNG = r123::Philox4x32>
struct DenseSkOp A sample from a distribution over matrices whose entries are iid mean-zero variance-one random variables. This type conforms to the SketchingOperator concept.
Public Types
Public Functions
-
inline DenseSkOp(DenseDist dist, const state_t &seed_state)
Arguments passed to this function are used to initialize members of the same names. \({\texttt{own_memory}}\) is initialized to true, \({\texttt{buff}}\) is initialized to nullptr, and \({\texttt{layout}}\) is initialized to \({\texttt{dist.natural_layout}.}\) The \({\texttt{next_state}}\) member is computed automatically from \({\texttt{dist}}\) and \({\texttt{next_state}.}\)
Although \({\texttt{own_memory}}\) is initialized to true, RandBLAS will not attach memory to \({\texttt{buff}}\) unless fill_dense(DenseSkOp &S) is called.
If a RandBLAS function needs an explicit representation of this operator and yet \({\texttt{buff}}\) is null, then RandBLAS will construct a temporary explicit representation of this operator and delete that representation before returning.
Public Members
-
const DenseDist dist
The distribution from which this operator is sampled; this member specifies the number of rows and columns of this operator.
-
const state_t seed_state
The state that should be passed to the RNG when the full operator needs to be sampled from scratch.
-
const state_t next_state
The state that should be used in the next call to a random sampling function whose output should be statistically independent from properties of this operator.
-
const int64_t n_rows
Alias for dist.n_rows.
-
const int64_t n_cols
Alias for dist.n_cols.
-
bool own_memory
If true, then RandBLAS has permission to allocate and attach memory to this operator’s \({\texttt{buff}}\) member. If true at destruction time, then delete [] will be called on \({\texttt{buff}}\) if it is non-null.
RandBLAS only writes to this member at construction time.
-
T *buff = nullptr
Reference to an array that holds the explicit representation of this operator as a dense matrix.
If non-null this must point to an array of length at least \({\texttt{dist.n_cols * dist.n_rows},}\) and this array must contain the random samples from \({\texttt{dist}}\) implied by \({\texttt{seed_state}.}\) See DenseSkOp::layout for more information.
-
const blas::Layout layout
The storage order for \({\texttt{buff}.}\) The leading dimension of \({\operatorname{mat}(\texttt{buff})}\) when reading from \({\texttt{buff}}\) is \({\texttt{dist.dim_major}.}\)
-
inline DenseSkOp(DenseDist dist, const state_t &seed_state)
-
template<typename DenseSkOp>
void RandBLAS::fill_dense(DenseSkOp &S) If \({\texttt{S.own_memory}}\) is true then we enter an allocation stage. If \({\texttt{S.buff}}\) is equal to \({\texttt{nullptr}}\) then it is redirected to the start of an new array (allocated with
new []) of length \({\texttt{S.n_rows * S.n_cols}.}\)After the allocation stage, we check \({\texttt{S.buff}}\) and we raise an error if it’s null.
If \({\texttt{S.buff}}\) is are non-null, then we’ll assume it has length at least \({\texttt{S.n_rows * S.n_cols}.}\) We’ll proceed to populate \({\texttt{S.buff}}\) with the data for the explicit representation of \({\texttt{S}.}\) On exit, one can encode a BLAS-style representation of \({\texttt{S}}\) with the tuple
\[(\texttt{S.layout},~\texttt{S.n_rows},~\texttt{S.n_cols},~\texttt{S.buff},~\texttt{S.dist.dim_major})\]In BLAS parlance, the last element of this tuple would be called the “leading dimension” of \(\texttt{S}.\)
-
template<typename T, typename RNG = r123::Philox4x32>
RNGState<RNG> RandBLAS::fill_dense_unpacked(blas::Layout layout, const DenseDist &D, int64_t n_rows, int64_t n_cols, int64_t ro_s, int64_t co_s, T *buff, const RNGState<RNG> &seed) -
This function provides the underlying implementation of fill_dense(DenseSkOp &S). Unlike fill_dense(DenseSkOp &S), this function can be used to generate explicit representations of submatrices of dense sketching operators.
Formally, if \({\mathbf{S}}\) were sampled from \({\mathcal{D}}\) with \({\texttt{seed_state=seed}}\) , then on exit we’d have
\[\operatorname{mat}(\mathtt{buff})_{ij} = \mathbf{S}_{(i+\mathtt{ro\_s})(\mathtt{co\_s} + j)}\]where \(\operatorname{mat}(\cdot)\) reads from \(\mathtt{buff}\) in \(\mathtt{layout}\) order.
If \({\texttt{layout != dist.natural_layout}}\) then this function internally allocates \({\texttt{n_rows * n_cols}}\) words of workspace, and deletes this workspace before returning.Full parameter descriptions
- layout
blas::Layout::RowMajor or blas::Layout::ColMajor
The storage order for \(\operatorname{mat}(\mathtt{buff})\) on exit. The leading dimension is the smallest value permitted for a matrix of dimensions (n_rows, n_cols) in the given layout. I.e., it’s n_rows if layout == ColMajor and n_cols if layout == RowMajor.
Note that since the entries of \(\mathtt{buff}\) are sampled iid from a common distribution, the value of \(\mathtt{layout}\) is unlikely to have mathematical significance. However, the value of \(\mathtt{layout}\) can affect this function’s efficiency. For best efficiency we recommend \(\texttt{layout=}\mathcal{D}{}\texttt{.natural_layout}.\) If a different value of \(\mathtt{layout}\) is used, then this function will internally allocate extra memory for an out-of-place layout change.
- D
A DenseDist object.
A distribution over random matrices of shape (D.n_rows, D.n_cols).
- n_rows
A positive integer.
The number of rows in \(\operatorname{mat}(\mathtt{buff}).\)
- n_cols
A positive integer.
The number of columns in \(\operatorname{mat}(\mathtt{buff}).\)
- ro_s
A nonnegative integer.
The row offset for \(\operatorname{mat}(\mathtt{buff})\) as a submatrix of \(\mathbf{S}.\)
We require that \(\mathtt{ro\_s} + \mathtt{n\_rows}\) is at most D.n_rows.
- co_s
A nonnegative integer.
The column offset for \(\operatorname{mat}(\mathtt{buff})\) as a submatrix of \(\mathbf{S}.\)
We require that \(\mathtt{co\_s} + \mathtt{n\_cols}\) is at most D.n_cols.
- buff
Buffer of type T.
Length must be at least \(\mathtt{n\_rows} \cdot \mathtt{n\_cols}.\)
- seed
A CBRNG state
Used to define \(\mathbf{S}\) as a sample from \(\mathcal{D}.\)
Sparse sketching, with CountSketch et al.
SparseDist : a distribution over structured sparse matrices
-
struct SparseDist
A distribution over matrices with structured sparsity. Depending on parameter choices, one can obtain distributions described in the literature as SJLTs, OSNAPs, hashing embeddings, CountSketch, row or column sampling, or LESS-Uniform distributions. All members of a SparseDist are const.
Public Functions
-
inline SparseDist(int64_t n_rows, int64_t n_cols, int64_t vec_nnz = 4, Axis major_axis = Axis::Short)
Arguments passed to this function are used to initialize members of the same names. The members \({\texttt{dim_major},}\) \({\texttt{dim_minor},}\) \({\texttt{isometry_scale},}\) and \({\texttt{full_nnz}}\) are automatically initialized to be consistent with these arguments.
This constructor will raise an error if \({\min\{\texttt{n_rows}, \texttt{n_cols}\} \leq 0}\) or if \({\texttt{vec_nnz}}\) does not respect the bounds documented for the \({\texttt{vec_nnz}}\) member.
Public Members
-
const int64_t n_rows
Matrices drawn from this distribution have this many rows; must be greater than zero.
-
const int64_t n_cols
Matrices drawn from this distribution have this many columns; must be greater than zero.
-
const Axis major_axis
Operators sampled from this distribution are constructed by taking independent samples from a suitable distribution \({\mathcal{V}}\) over sparse vectors. This distribution is always over \({\mathbb{R}^k,}\) where \({k = \texttt{dim_major}.}\) The structural properties of \({\mathcal{V}}\) depend heavily on whether we’re short-axis major or long-axis major.
To be explicit, let’s say that \({\mathbf{x}}\) is a sample from \({\mathcal{V}.}\)
If \({\texttt{major_axis} = \texttt{Short}}\) , then \({\mathbf{x}}\) has exactly \({\texttt{vec_nnz}}\) nonzeros, and the locations of those nonzeros are chosen uniformly without replacement from \({\{0,\ldots,k-1\}.}\) The values of the nonzeros are sampled independently and uniformly from +/- 1.
If \({\texttt{major_axis} = \texttt{Long}}\) , then \({\mathbf{x}}\) has at most \({\texttt{vec_nnz}}\) nonzero entries. The locations of the nonzeros are determined by sampling uniformly with replacement from \({\{0,\ldots,k-1\}.}\) If index \({j}\) occurs in the sample \({\ell}\) times, then \({\mathbf{x}_j}\) will equal \({\sqrt{\ell}}\) with probability 1/2 and \({-\sqrt{\ell}}\) with probability 1/2.
-
const int64_t dim_major
Defined as
\[\begin{split}\texttt{dim_major} = \begin{cases} \,\min\{ \texttt{n_rows},\, \texttt{n_cols} \} &\text{ if }~~ \texttt{major_axis} = \texttt{Short} \\ \max\{ \texttt{n_rows},\,\texttt{n_cols} \} & \text{ if } ~~\texttt{major_axis} = \texttt{Long} \end{cases}.\end{split}\]
-
const int64_t dim_minor
Defined as \({\texttt{n_rows} + \texttt{n_cols} - \texttt{dim_major}.}\) This is just whichever of \({(\texttt{n_rows},, \texttt{n_cols})}\) wasn’t identified as \({\texttt{dim_major}.}\)
-
const double isometry_scale
An operator sampled from this distribution should be multiplied by this constant in order for sketching to preserve norms in expectation.
-
const int64_t vec_nnz
This constrains the number of nonzeros in each major-axis vector. It’s subject to the bounds \({1 \leq \texttt{vec_nnz} \leq \texttt{dim_major}.}\) See this tutorial page for advice on how to set this member.
-
const int64_t full_nnz
An upper bound on the number of structural nonzeros that can appear in an operator sampled from this distribution. Computed automatically as \({\texttt{full_nnz} = \texttt{vec_nnz} * \texttt{dim_minor}.}\)
-
inline SparseDist(int64_t n_rows, int64_t n_cols, int64_t vec_nnz = 4, Axis major_axis = Axis::Short)
SparseSkOp : a sample from a SparseDist
-
template<typename T, typename RNG = r123::Philox4x32, SignedInteger sint_t = int64_t>
struct SparseSkOp A sample from a distribution over structured sparse matrices with either independent rows or independent columns. This type conforms to the SketchingOperator concept.
Public Types
-
using distribution_t = SparseDist
Type alias.
Public Functions
-
inline SparseSkOp(SparseDist dist, const state_t &seed_state)
Standard constructor. Arguments passed to this function are used to initialize members of the same names. own_memory is initialized to true, nnz is initialized to -1, and (vals, rows, cols) are each initialized to nullptr. next_state is computed automatically from dist and seed_state.
Although own_memory is initialized to true, RandBLAS will not attach memory to (vals, rows, cols) unless fill_sparse(SparseSkOp &S) is called.
If a RandBLAS function needs an explicit representation of this operator and yet nnz < 0, then RandBLAS will construct a temporary explicit representation of this operator and delete that representation before returning.
Public Members
-
const SparseDist dist
The distribution from which this operator is sampled.
-
const state_t seed_state
The state passed to random sampling functions when the full operator needs to be sampled from scratch.
-
const state_t next_state
The state that should be used in the next call to a random sampling function whose output should be statistically independent from properties of this operator.
-
const int64_t n_rows
Alias for dist.n_rows. Automatically initialized in all constructors.
-
const int64_t n_cols
Alias for dist.n_cols. Automatically initialized in all constructors.
-
bool own_memory
If true, then RandBLAS has permission to allocate and attach memory to this operator’s reference members (S.rows, S.cols, and S.vals). If true at destruction time, then delete [] will be called on each of this operator’s non-null reference members.
RandBLAS only writes to this member at construction time.
-
int64_t nnz
The number of structural nonzeros in this operator. Negative values are a flag that the operator’s explicit representation hasn’t been sampled yet.
-
T *vals
Reference to an array that holds the values of this operator’s structural nonzeros.
If non-null, this must point to an array of length at least dist.full_nnz.
-
using distribution_t = SparseDist
-
template<typename SparseSkOp>
void RandBLAS::fill_sparse(SparseSkOp &S) If \({\texttt{S.own_memory}}\) is true then we enter an allocation stage. This stage inspects the reference members of \({\texttt{S}}\) . Any reference member that’s equal to \({\texttt{nullptr}}\) is redirected to the start of a new array (allocated with
new []) of length \({\texttt{S.dist.full_nnz}.}\)After the allocation stage, we inspect the reference members of \({\texttt{S}}\) and we raise an error if any of them are null.
If all reference members are are non-null, then we’ll assume each of them has length at least \({\texttt{S.dist.full_nnz}.}\) We’ll proceed to populate those members (and \({\texttt{S.nnz}}\) ) with the data for the explicit representation of \({\texttt{S}.}\) On exit, \({\texttt{S}}\) can be equivalently represented by
RandBLAS::COOMatrix mat(S.n_rows, S.n_cols, S.nnz, S.vals, S.rows, S.cols);
-
template<typename T, typename sint_t, typename state_t>
state_t RandBLAS::fill_sparse_unpacked_nosub(const SparseDist &D, int64_t &nnz, T *vals, sint_t *rows, sint_t *cols, const state_t &seed_state) -
This function is the underlying implementation of fill_sparse(SparseSkOp &S). It has no allocation stage and it skips checks for null pointers.
On entry, \({(\mathtt{vals},\mathtt{rows},\mathtt{cols})}\) are arrays of length at least \({{\mathcal{D}\mathtt{.full\_nnz}}.}\) On exit, the first \({\texttt{nnz}}\) entries of these arrays contain the data for a COO sparse matrix representation of the SparseSkOp defined by \({(\mathcal{D},\texttt{seed_state)}.}\)
The unifying (C++20) concepts
SketchingDistribution
-
template<typename SkDist>
concept SketchingDistribution - #include <base.hh>
Mathematical description
Matrices sampled from sketching distributions in RandBLAS are mean-zero and have covariance matrices that are proportional to the identity.
Formally, if \(\mathcal{D}\) is a distribution over \(r \times c\) matrices and \(\mathbf{S}\) is a sample from \(\mathcal{D}\), then \(\mathbb{E}\mathbf{S} = \mathbf{0}_{r \times c}\) and
\begin{gather} \theta^2 \cdot \mathbb{E}\left[ \mathbf{S}^T\mathbf{S} \right]=\mathbf{I}_{c \times c}& \nonumber \\ \,\phi^2 \cdot \mathbb{E}\left[ \mathbf{S}{\mathbf{S}}^T\, \right]=\mathbf{I}_{r \times r}& \nonumber \end{gather}hold for some \(\theta > 0\) and \(\phi > 0\).
The isometry scale of the distribution is \(\alpha := \theta\) if \(c \geq r\) and \(\alpha := \phi\) otherwise. If you want to sketch in a way that preserves squared norms in expectation, then you should sketch with a scaled sample \(\alpha \mathbf{S}\) rather than the sample itself.
Programmatic description
A variable \(\texttt{D}\) of a type that conforms to the \(\texttt{SketchingDistribution}\) concept has the following attributes.
type
description
\(\texttt{D.n_rows}\)
\(\texttt{const int64_t}\)
samples from \(\texttt{D}\) have this many rows
\(\texttt{D.n_cols}\)
\(\texttt{const int64_t}\)
samples from \(\texttt{D}\) have this many columns
\(\texttt{D.isometry_scale}\)
\(\texttt{const double}\)
See above.
Note that the isometry scale is always stored in double precision; this has no bearing on the precision of sketching operators that are sampled from a \(\texttt{SketchingDistribution}\).
Notes
RandBLAS has two SketchingDistribution types: DenseDist and SparseDist. These types have members called called “major_axis,” “dim_major,” and “dim_minor.” These members have similar semantic roles across the two classes, but their precise meanings differ significantly.
SketchingOperator
-
template<typename SKOP>
concept SketchingOperator - #include <base.hh>
A type \({\texttt{SKOP}}\) that conforms to the SketchingOperator concept has three member types.
\(\texttt{SKOP::distribution_t}\)
A type conforming to the SketchingDistribution concept.
\(\texttt{SKOP::state_t}\)
A template instantiation of RNGState.
\(\texttt{SKOP::scalar_t}\)
Real scalar type used in matrix representations of \(\texttt{SKOP}\text{s}.\)
And an object \(\texttt{S}\) of type \(\texttt{SKOP}\) has the following instance members.
\(\texttt{S.dist}\)
\(\texttt{const distribution_t}\)
Distribution from which this operator is sampled.
\(\texttt{S.n_rows}\)
\(\texttt{const int64_t}\)
An alias for \(\texttt{S.dist.n_rows}.\)
\(\texttt{S.n_cols}\)
\(\texttt{const int64_t}\)
An alias for \(\texttt{S.dist.n_cols}.\)
\(\texttt{S.seed_state}\)
\(\texttt{const state_t}\)
RNGState used to construct an explicit representation of \(\texttt{S}\).
\(\texttt{S.next_state}\)
\(\texttt{const state_t}\)
An RNGState that can be used in a call to a random sampling routine whose output should be statistically independent from \(\texttt{S}.\)
\(\texttt{S.own_memory}\)
\(\texttt{bool}\)
A flag used to indicate whether internal functions have permission to attach memory to \(\texttt{S},\) and whether the destructor of \(\texttt{S}\) has the responsibility to delete any memory that’s attached to \(\texttt{S}.\)
RandBLAS only has two SketchingOperator types: DenseSkOp and SparseSkOp. These types have several things in common that aren’t enforced by the SketchingOperator concept. Most notably, they have constructors of the following form.
SKOP(distribution_t dist, state_t seed_state) : dist(dist), seed_state(seed_state), next_state(/* type-specific function of state and dist */), n_rows(dist.n_rows), n_cols(dist.n_cols), own_memory(true) /* type-specific initializers */ { };