If it were me, I would implement this as RS232 serial comms, and then try to go about seeing if someone wanted to write a generic audio-to-serial device driver that emulated a serial port at the kernel level. Then the separation of duties in the code is appropriate, and the audio overhead is totally optional.
Having worked with modems extensively, and currently running a business that still supports 1200 baud dialup modems to this day (I wrote 100% of the code for that) I think this is going to be super difficult to get working right and reliably, with decent error detection and correction robust enough to give satisfactory results in varying real-world conditions.
The most recent generation (post-1990) of dialup modems will do things like periodically sweep the entire frequency range of a telephone connection to find out what portions of the audio spectrum are available on a particular call, and cram all of the data in the usable spectrum. That's the result of years of the innovation of multiple modem companies competing to make the fastest modem - something you're not going to be able to implement starting from scratch.
It's like trying to reinvent the entire 1980's and early 90's worth of modem technology all over again, but with unprecedented new constraints: audio hardware with wide signal ranges built to a loose standard, user-settable volume and equalizer controls, and audio jacks which were never meant to be cross-connected. And, for it to all be re-invented by one guy, rather than a whole industry.
Audio inputs on computers are rarely meant to be directly connected to the audio outputs of other computers. In most cases, the audio outputs are at a level for driving headphones, and this level will grossly overdrive any computer audio input, given that a headphone jack is meant to provide power to a pair of headphones, but inputs are meant to talk to microphones or line-level inputs (signals that are weaker by orders of magnitude than headphone output). The receiver will hear nothing but clipped static. A user can sort of get away with it for audio applications by turning the output volume down really low, but not if you're asking the user to do something that makes them unable to hear the signal, like not hook up speakers, and that's assuming the user is qualified to know what's the appropriate sound level for a data signal meant to be decoded by a computer.
Then you have to deal with things like ground loops, and crappy unshielded sound cards that pick up all kinds of unwanted noise from CPU that limit your bandwidth to nowhere even approaching 56Kbits/sec on random unpredictable occasions and... suddenly, this project has become more complicated than the whole of Armory itself!
By the time you start to write the troubleshooting guide for this, and try to teach your user how to figure out what line level his inputs and outputs are so his two machines can talk, you'll reluctantly fold this audio idea and go with something like RS232.
tl;dr: I'll bet you 20 BTC this plain isn't going to work even remotely as well as RS232 will - 20BTC you can add to the bounty just in case it does!