Step one would be to design and implement a high speed interconnect. You want this as wide (meaning, as many data lines) as possible, and you want it to be as fast as possible.
Step two would be to design a high speed shared storage architecture. You want the different nodes to be able to talk to this storage simultaneously, so you have to manage read and write locks, and handle deadlock conditions, etc.
Step three would be to design the software system that will manage the task of dividing up work to the different nodes, and managing the nodes themselves (bringing them online, taking them offline, handling hardware and software failures).
If you ever read Slashdot during the heyday of the “imagine a Beowulf cluster of these!” meme/joke, then you might know that the gist of the joke is that just because something has a lot of processing power, that doesn’t mean that 1) it’s suitable for massively-parallel computing, or that 2) it’s easy to slap a bunch of computing devices together into an HPC cluster.
I’m not an HPC/supercomputer expert, but I did have a friend that worked for the company responsible for some of the largest clusters available at the time, and I can say, a LOT of work went into the “tying them together” pieces of the puzzle.