1.) The first thing is the client says if it wants to cache the images or not. (As well as what faceset the client wants)
1A.) In addition to that, when the server first connects, it tells the server what image set it wants to use. The server provides the image data/checksum data for the appropriate set the client has requested.

1.1) There are two main cases - whether the client is caching the images, or whether it isn't.

1.1.1) If the client is caching, there are then two ways that may work. In the first case, before an image is used, the server sends the image number, checksum, and image name to the client. The client then sees if it has something that matches, and if so, uses it.

1.1.1A) IF the client doesn't see a match, the client then sends a request to the server for image X (this is a bit simplistic - the client, if it has something of the right name, but not checksum, could choose to keep its own image).

In this case (cache images, but don't get all the data before play), a user may see the ? images briefly during play - this is because the client needs to display something, but doesn't have the proper image yet.

1.1.2) If the client isn't caching, then the server sends the image data to the client before the client needs the face. Server tracks that it has sent image X, so it only needs to send it once.

1.2) The second case is the downloadallimages (or something like that) option. In that case, the client requests the checksums for a group of images, and the server sends the checksums for a group (makes it faster/more efficient).

2.) If the user gets the crossfire-images archive and install it, it basically gives the client a bunch of images so if it caches them, it won't need to download them.