In practice: no, not if you need fast 3D.
Doing a small step in software will always incur a performance penalty, as all involved textures have to be moved to main memory, the rendering step has to be performed by the CPU, then everything has to be moved back. The GPU cannot perform any more work on said textures and will probably be idle for the duration.
Now remember that a GPU has a pretty long pipeline and that you'll usually have to free the pipeline before you can move a render target to CPU space and you'll see that it's pretty much infeasible for many 3d operations. Doing a whole bunch of stuff in software can work, tightly interleaving many soft- and hardware operations may easily end up slower than full software rendering.
geometry or vertex shaders may work, since they're pretty early in the pipeline. Every drawing command starts at the CPU (application), is processed a little in the drivers (still CPU) and is then passed on to the GPU. The preprocessing on the driver side can do some additional steps before passing it on to the GPU without incurring additional copies.
(note that there are exceptions, i.e. vertex shaders with texture lookups when the texture was written to before)
Yep. In general you can shift the point where processing passes from CPU to GPU (albeit with a performance penalty) but going back and forth between CPU and GPU is almost always a Bad Thing.
CPUs are fast when working on data in system memory; GPUs are fast when working on data in video memory. Asking the CPU to work on something in video memory results in truly awful delays, 10x-50x slower than you would expect from just doing the work on the CPU.
Mixing GPU and CPU processing is a bit less painful on IGPs with shared memory because (a) CPU access to "video memory" is faster and (b) GPU access to "video memory" is slower relative to GPUs with dedicated video memory, but this doesn not generalize to discrete GPUs at all.
It is certainly possible to write a driver which could do some back-and-forth processing efficiently, but it would require that the entire stack be designed up-front to deal with those cases, and would litter complexity all through the stack. In a proprietary driver this is sometimes possible, since you only have to support a single vendor and have access to future hardware plans, but for an open source driver this seems impractical.
So... forcing geometry shading on by default when doing vertex processing on the CPU ("SW TCL") could work, assuming the memory manager could be directed to keep vertex textures in system memory, but anything else is probably a non-starter.
The same applies for video processing by the way - doing the front part of the decode stack on CPU and the rest on GPU works well, but anything else tends to be very slow.
PS: Sorry... pipeline... *ducks and runs*
The problem is that geometry shaders come after vertex shaders in the pipeline, so you can't easily "preprocess the geometry shader work" and pass the rest to the GPU.
That was my reason for suggesting that GS be exposed by default when doing vertex processing on the CPU - that would give a "somewhat accelerated" compromise which might both perform OK and be easy to implement & use.