The cheapest way to transpose a matrix in a computer is to simply change the labelling of the locations. To understand this, first we should understand how a generic way computer stores a vector and a matrix.
Suppose we had a column vector with n entries, and the computer stores the entries in memory slots 67 to 67+n-1. Then, when you ask the computer for the ith entry, it does not go to memory slot 67 and count i slots, it simply goes to slot 67+i (assuming 0-indexing) directly. This is different from how you or I do it when looking at a vector written on a page---we have to count some number of slots from the beginning (or end!) to get to the entry we want.
Suppose we had a matrix with n rows and m columns and the computer stores the entries in memory slots 85 to 85+mn-1. Then the canonical thing to do is to store the matrix in row-major order, which means storing the first row and then the second row and then the third, etc. Then, when you ask the computer for the (row,column)=(i,j)th entry, it does not count in two dimensions in a grid like you or I (the memory doesn't even know the data is laid out in a grid). Instead, it just goes to slot 85+im+j.
Suppose you want to transpose the above matrix. The fastest way is not to actually move any of the entries. Instead, the fastest way is to change the algorithm for accessing them. It is easy to see that if you want the (i,j)th entry of the transposed matrix, you should simply look at entry 85+jn+i---we have simply switched the role of rows and columns in the addressing system.
This is quite handy. But it's not always the best thing to do. There are a number of tasks where you need a whole row. Suppose you need the ith row. Then just go to memory slot 85+im, the first entry in the row, and read until you get to 85+(i+1)m-1, the last entry in the same row. Reading these entries is fast because they're all in order. But if you need a whole column, you have to jump around the memory quite a bit because the entries of a single column are sprinkled evenly throughout the memory, and that is quite slow. SO, if you're going to need to access whole columns frequently and whole rows infrequently, you're better off moving the entries around so that the columns are stored contiguously. Thus, we need to come up with an algorithm for transposing a matrix in memory.
Suppose you had a matrix that took up more than half your memory (for example, you're google and you're trying to diagonalize the PageRank matrix, which is so big you have to store it on multiple computers). Then you can't just make a copy in a different order because you wouldn't have enough room to write it down. Suppose you've used up 99.999% of your memory and yet you want to transpose a matrix. Is it possible to do so without needing more space than the matrix actually occupies (plus just two ((for example)) entries worth of scratch space)? This question is usually phrased as: "is it possible to transpose a matrix in place?" or "is it possible to transpose a matrix in O(1) space?" The problem is this: every time you want to write a value in a memory slot, if you don't store the entry currently stored there in your scratch space you fail: you will have overwritten a piece of the matrix that you cannot get back. So before writing anywhere, you'd better rescue the piece that you used to occupy that memory location. BUT, that dictates the next entry you need to move because you just don't have enough space to keep too many rescued pieces of information in memory. If you were moving something from memory slot 331 to 743, your next immediate step is to move the entry in 743 someplace else.
For square matrices, this is really easy, because the size of a row in memory at the beginning will be the same as the size of the row after transposition (because it is square, the columns have the same number of slots as the rows). So, if you were moving the entry at memory slot 331 to slot 743, then just move the entry from 743 to slot 331. Do this for a well-chosen half of the matrix (the upper triangle is a fine choice) and you will have accomplished the task.
However, it becomes significantly more complicated when the matrix is not square, because you can't just swap two entries. It's conceivable that every time you make an unforced move, that could force you to move many of the entries. Interestingly, Wikipedia has an article dedicated to
the solution of this problem. The REALLY interesting part is that it is solved and the solution is nice.