I only mentioned FHE to cement the notion with "overkill" on the math, it certainly wouldn't be the best approach at an SFE delegation solution.
By some very rough number I estimate that state of the art in "general" SFE would probably put the cpu performance on the order of tens of milliseconds per signature, at best. (I haven't actually built a circuit for it, so I could be way off, this is just based on a guestimate of gate/layer counts based on my experience with similar circuit topology on contemporary processors.) Considering you get 64 rounds per signature I think 1kh/s is not so lofty of an initial goal.
My main concern is that this low performance is assumed with generalized approaches, but there might be (probably is, from my understanding) a special case transform of the puzzle into a secured "delegated work" puzzle which would be reasonably efficient to evaluate, at least within the same order as signing directly. In other words, an "application specific SFE" scheme might vastly outperform our notions of resource requirements from generalized approaches.
SFE refers to some network protocol where there are several parties each of which has some private information and they want to compute public function based on this information without revealing this information to each other. Is this what do you mean by SFE? In our case only pool has private information - private key (and random k) which it doesn't want to share, in this case SFE is trivial - pool should compute signature on its own and share it which is not suitable for our case.
In FHE you have encrypted inputs and encrypted outputs, but to proceed further with the mining miner must decrypt the output which probably implies that he can also decrypt the input which defeats the whole purpose.
Do you have some more specific description of a your scheme?
Address is always recoverable from the signature. This fact is also used to make SpreadCoin transactions smaller.
This simply isn't true in general, but might be true in the particular implementation. I'm not sure yet, but a simple brute force address generation should find one quickly and easily, or something like crypto-mini-sat should find one slowly and with some work. I'll let you know of an example that doesn't extract if I do find the time to find one, and also by which route I found it.

Ok, you can try but you won't find it.

Avoiding GPUs/ASICs is not a goal because it is probably not possible. I don't see how can you get any gains by ignoring nonce, this will only make mining slower because by iterating on r values you will have more work to do. You can use the same method that you describe but with nonce iteration; it can save you memory bandwidth but it won't be as fast as X11 because there is simply more work to do.
I didn't make clear that the optimization to iterate k instead of nonce applies to any processing element, it eliminates the repetition of nonce increment and initial hash. Basically, this coin is somewhat unique in that it might as well not even have a nonce field in the block header at all, since the signature field can serve the same purpose. (Since you are looking for any place to reduce block history size, you may actually even want to make this change?) I haven't tried any mining of the coin, but I might mine just to try comparing performance of this change.
Incrementing nonce is easier (faster) than changing k. Although changing k instead of nonce eliminates the need to recompute hash used for signature so it may or may not be faster.
My point was basically that, because of this optimization, the thing that might initially make this algorithm appear to be particularly "gpu hard" is a bit misleading, and the algorithm is not really any more "gpu hard" than x11 itself. Yes, it will always be slightly slower than "just" x11 on any processing element, because of the added signature step, but it is only some added cycles and not more memory hardness. The performance *advantage* from parallelism of gpu or fpga should be roughly the same as that of x11 itself.
You forgot about SHA-2 hashing of the whole block. It will be slower and not just "slightly". By the same reason CPU mining is slower. And I never said that this algorithm is GPU hard.