We should probably get the terminology right.
On an image sensor, “pixels” do not capture light. A “pixel” is only present in a reconstructed image.
A “photosite” responds to light, but it takes more than one photosite to create a full color pixel. Pixels are not made up of equal numbers of 3 primary color pixel sites.
www.cambridgeincolour.com
The physical size of a photosite directly relates to the minimum noise of that site. Bigger photosites respond to more photons, and less noise per site is the result.
Because cameras all now use extensive post-processing, there is only a rough relationship between the size of an image sensors photosites and the noise of the sensor vs temperature. Processing is a game changer. But in general, as far as the sensor goes, smaller photosites = higher noise (pre-processing).
RE: film vs digital, a photosite responds linearly to light. Twice the light, twice the charge on the sensor. Eyes respond non-linearly to light, as do all senses respond to their corresponding stimuli. Twice the light is not perceived as twice as bright, and twice the sound pressure is not perceived as twice the loudness. Human senses compress their response for increased, and incredibly wide, dynamic range. Digital image sensors by themselves do not do this. But film sort of does. Film does not have a linear response to light.
You might be going to “advantage: Film”, but hang on. IF you have enough digital information, you can apply a correction curve (gamma) to the raw data and get that same type of compression. More bits is an advantage here. That’s because, if you just look at how many digital “levels” exist per stop, half of the available levels are used in the very top stop before highlights clip. That’s on the raw sensor data. But post-processing (the jpg preview you see on the back screen, along with its histogram) are pre-corrected, so you don’t ever see it exactly that way until you process a RAW image, and then, some assumptions are already made for you.
This is where the Expost To The Right idea comes from. Deliberately over-expose and use more of the available data. It’s not wrong, but also, is not done much.
So when it comes to dynamic range of cameras and film, we are comparing apples and oranges unless we look carefully at the final, corrected image. Which, BTW, we absolutely cannot see in most comparisons of the two, especially in the video posted by the OP in post #1. Part of digital imaging is post-processing, a very big and important part. It cannot, and should not be ignored or discounted. So, that, plus the fact you can’t practically scan film’s full dynamic range without excessive trouble and expense.