A thermal camera that produces 24x32 pixels images is probably insufficient to train a model to detect also who is in the room. As you can see in the sample images in the notebook, in most of the cases the camera detects dim halos, and it has a better view of the body shape only if someone is standing relatively close (within 1.5 meters). Since you should also add all the possible variants (someone is standing, walking, sitting etc.), the data you’ve got is clearly insufficient. With more samples you can probably train a model that detects how many people are in the room, but to detect who’s in the room the best bet is probably a combination of thermal camera (which would only activate if people presence is detected) together with an optical camera. The model trained in the notebook should suffice for the thermal “trigger”, while the people recognition model is probably best trained with a convolutional layer, and at least a few thousands images to take into account all the combinations about lighting, people’s position etc.