I have to implement a system that analyses the image data captured by
a video camera mounted in the window of a shop.
(The images captured are in a grey-scale format)

I need to perform basic activity, person tracking and recognition.

The activity recognition is aimed to determine the level of interest
in the items displayed in the shop window. There are two types of
activities that need to be recognized: (1) person walking past the
shop and (2) person looking at the shop window. By counting the number
of persons that stop and look for some time at the window, the level
of interest in the shop can then be derived.
The person tracking recognition is aimed at determining how many
different persons pass in front of the shop. This information can be
again used to determine the level of interest in the shop (person may
be returning to get another look at the items on display) as well as
identify possible criminal intent (“scoping the place”).
Therefore, the system needs to be able to track and identify all
persons in the scene and label their overall activity.
The activity needs to be labeled based on the cumulative information
about each person tracked. Specifically,
each new person entering the scene will be given the “person walking”
activity label. If they stop in front of the shop then, their activity
changes to “person looking at the window shop”. The system receives
the information in the form of a sequence of images.

The system should be able to do the following:
a) build a suitable average frame from a given sequence of images
b) clearly specify how many persons are present in each of the test
frames as well as:
i) the position and identity of each of the persons (using a
bounding box) and the label of their activity
ii) clearly specify if any of the persons have been seen before in
the sequence

SO i need some advice on how I can achieve this?

Thanks