This article comapres the implementation details between the original Transformer and GPT. These tricks are critical to performance but not always explained in the paper.
In a setting of PEAD (post-earnings-announcement-drift) prediction using earnings call transcripts, I found Transformers (deep learning models) have a larger performance lead on extreme data points (data at the tails of the distribution).