SFE refers to some network protocol where there are several parties each of which has some private information and they want to compute public function based on this information without revealing this information to each other. Is this what do you mean by SFE?
No, that would be a general description of Secure Multiparty Computation (SMC) not Secure Function Evaluation (SFE) which is a specific, 2-party, subset case. SFE simply refers to the ability to find enc(y)=F(x) given only some encrypted enc(x) of plaintext x and a correspondingly transformed F. The evaluation of F certainly learns enc(y) and possibly learns y (depending upon the scheme employed) but does not learn anything about x. As such, it can not learn anything about F except its logical topology.
Yao's Garbled Circuits are the canonical example of SFE, and do not necessarily require interaction protocols.
In the case of an iterative puzzle, SFE can be done offline by giving the evaluating party the ability to selectively choose some inputs like nonce, knowing their representation, "on behalf of" the issuer. The solver can independently re-evaluate the circuit as many times as he'd like choosing new inputs.
In our case only pool has private information - private key (and random k) which it doesn't want to share, in this case SFE is trivial - pool should compute signature on its own and share it which is not suitable for our case.
Right, that would not be SFE. The point is that the pool can construct a "program" that others can run that will sign headers without the servers further involvement, and without disclosing the private key, but will not sign non-header data.
In FHE you have encrypted inputs and encrypted outputs, but to proceed further with the mining miner must decrypt the output which probably implies that he can also decrypt the input which defeats the whole purpose.
The encryption of the input and output do not necessarily relate. Some inputs could be encrypted with one key and some with another. Outputs could be encrypted with not just these keys but possibly third and fourth and fifth keys, and so on. Some might be encrypted with both, so both parties could learn some result. (This may put some constraints on the construction of F.) This principle is one way that SFE can be extended into more general SMC.
Do you have some more specific description of a your scheme?
It would probably be best to have the pool generate a circuit per user, per block. This circuit would take in just a nonce value as "worker determinable" The circuit would compute the header hash and a deterministic k, and then produce the encrypted signature bits as output. The pool would give the output mapping from pairs of encrypted bit values to bit states so the worker could know the output without needing to decrypt it. The output could then be inserted back into the block header for use in the 64 rounds of hashwholeblock hash by the miner.
Implementation would be straightforward on something like FairplayBI, if you don't mind Java. (Personally, I'd go for something less straightforward and much more efficient, but most importantly not Java.)
Ok, you can try but you won't find it.

I've probably missed something obvious and you're probably right. We'll see.
Incrementing nonce is easier (faster) than changing k.
Err, what? If you increment nonce (and I mean above the mask) you need a new k anyway.
Although changing k instead of nonce eliminates the need to recompute hash used for signature so it may or may not be faster.
Right, that is precisely what I meant. You don't ever strictly need more than one hash as input to signing.
You forgot about SHA-2 hashing of the whole block. It will be slower and not just "slightly". By the same reason CPU mining is slower.
Not sure I follow what you mean here. The SHA is not iterated, so it is again just more cycles not more memory scheduling. The factor of performance advantage is dominated by memory bandwidth, so the relative gain should be largely unaffected. In other words, both the cpu and the gpu need to spend approximately the same extra effort there, so it shouldn't have much impact on the overall performance ratio.
And I never said that this algorithm is GPU hard.
I never said that you did, but you do seem to have it as a goal to try to keep CPUs competitive, so I was just discussing some details toward such an end.