Because with the form:
$C(s) = K_ds + K_p$
you have a non-proper transfer function. In process engineering, you always need to have proper transfer functions, where the degree of the numerator is less than or equal to the degree of the denominator.
A work around is to introduce an "approximate derivative", which is the second form:
$C(s) = (K_ds + K_p)/(\tau s + 1)$
if $\tau$ becomes too large, it changes the dynamics of the transfer function significantly, and therefore the overall closed-loop response. This also helps reduce the amount of noise generated by the derivative term (acts as a low-pass filter, see PID controllers on Wikipedia).
In practice, however, derivative gains are rarely used, PI controllers being the preferred implementation.
The causality aspect is because in the non-causal PD controller, you have more zeroes than poles, see this discussion.