I'm not sure about papers. Some of the low-level field arithmetic in libsecp256k1 was inspired by techniques used in certain curve25519/ed25519 implementations, but it has certainly evolved from there, with many optimizations by several contributors.
I came up with the bracket notation, and it isn't an optimization; just a concise way of of writing down the data flow to allow (humans) to reason about the correctness of the algorithm.