Recently, a great amount of visual information has been generated and stored through computer, TV, Internet, surveillance cameras and digital libraries. In the meanwhile, however, people are interested only in meaningful events such as sports highlights and abnormal activities in surveillance. To fast deliver such events to people, the automatic recognition of them must be required. The video event recognition begins with the decomposition of a video into smaller segments called shots. To efficiently characterize a shot, we extract temporal-scale invariant key-frames (temporal interest points: TIPs) corresponding to motion discontinuous instants that give rise to attention in human visual system. Using the image features (color, edge, and motion) included in TIPs we can recognize and categorize the meaningful events, for example, golf highlights and indoor human activities. This work could be employed in content-based video retrieval systems on the Internet, and should be useful to beginners and professionals who study scale space theory in computer vision, or anyone else who may be interested in video indexing and retrieval.