add a simple stacked version of mqar that can be solved by a 1 layer sequence mixer (no short convs needed for the shift); add a continuous model (i.e., that operates in embedding space rather than discrete tokens); add more loss functions for training (dot-product ce, mse)