> I would check whether the stride and the width are infact the same and > then use the single memcpy in that case. Only when they differ use the > slow path. Done - I use memcpy_cpi, which does this internally!