(I don’t use CS so Di5 is a spare pin; I only use MOSI.) I had to tweak the SPI parameters to get the timing even: clock edge, hold time, etc, etc. At these timing settings when you send a series of bytes the clock timing is steady and continuous, even between bytes.
At 40kHz, each bit is 2.5us. To control the Christmas lights, each bit is 30us: either 10us high /20us low or 20us high/10us low (0/1 bit). So to send a bit:
1111 1111 0000 0000 sends 20us high, 10us low (and an extra 10us low)
To make the bit and byte boudaries line up, I map two bits into 3 bytes of SPA.
It works remarkably well, except encoding the data at the bit level in managed code is a little sloooow.