Hello folks,

Considering a hypothetical situation whereby one has to statistically analyze a large mercurial or git repository. By large it is 10, 20, 30 or whatever years worth of commits. People have done this before but, to the best of my knowledge, only to a small section of the repository, with the intent of analyzing recent repository activity.

It looks to me like analyzing a repository, statistically can, for instance, assist in identifying critical sections of the repository. My assumption might be wrong but subsystems getting a lot of activity are either getting some new development, under repair or, maybe something interesting in happening. Maybe, maybe not. Given the above, it would probably be a very interesting adventure to analyzing some of the huge repositories at my disposal.

My preferred development languages for this adventure would be Perl and R. Perl for data extraction and R for analysis, inclusive of graphing. Not being very familiar with these two languages, their capabilities are not very clear to me Given the large amounts of data that are probably going to be involved, it would be great if someone had some input on this.

Also, instead of dealing with text files, it would probably be great to access the blobs/databases directly. This would also probably speed things up but my familiarity with CVS is probably not expert level, yet. Is this possible? Could someone have some input on this.

Last but not least, any input will be considered worthwhile.

Regards

:-)