The difficulty of training artificial recurrent neural networks has to do with their complexity.
One of the simplest ways to explain why recurrent neural networks are hard to train is that they are not feedforward neural networks.
In feedforward neural networks, signals only move one way. The signal moves from an input layer to various hidden layers, and forward, to the output layer of a system.
By contrast, recurrent neural networks and other different types of neural networks have more complex signal movements. Classed as “feedback” networks, recurrent neural networks can have signals traveling both forward and back, and may contain various “loops” in the network where numbers or values are fed back into the network. Experts associate this with the aspect of recurrent neural networks that's associated with their memory.
In addition, there's another type of complexity affecting recurrent neural networks. One excellent example of this is in the field of natural language processing.
In sophisticated natural language processing, the neural network needs to be able to remember things. It needs to take inputs in context, too. Suppose there is a program that wants to analyze or predict a word within a sentence of other words. There may be, for example, a fixed length of five words for the system to evaluate. That means the neural network has to have inputs for each of these words, along with the ability to “remember” or train on the context of these words. For those and other similar reasons, recurrent neural networks typically have these little hidden loops and feedbacks in the system.
Experts lament that these complications make it difficult to train the networks. One of the most common ways to explain this is by citing the exploding and vanishing gradient problem. Essentially, the weights of the network will either lead to exploding or vanishing values with a large number of passes.
Neural network pioneer Geoff Hinton explains this phenomenon on the web by saying that backward linear passes will cause smaller weights to shrink exponentially and larger weights to explode.
This problem, he continues, gets worse with long sequences and more numerous time steps, in which the signals grow or decay. Weight initialization may help, but those challenges are built into the recurrent neural network model. There's always going to be that issue attached to their particular design and build. Essentially, some of the more complex types of neural networks really defy our ability to easily manage them. We can create a practically infinite amount of complexity, but we often see predictability and scalability challenges grow.