What is the exect meaning of counts output by pga get

I have selected some projects from PGA.
Output of pga list is a description of projects with value of different counts for example those of the small project CVE-pocs:

{"url":"git://github.com/0x36/CVE-pocs.git","sivaFilenames":["06aa275d3a98e44cd5c1af8a72f1c3c220689ecf.siva"],"size":18680,"license":"","langs":["C","Markdown"],"langsByteCount":[29528,64],"langsLinesCount":[1024,3],"langsFilesCount":[6,1],"emptyLinesCount":[208,0],"codeLinesCount":[561,2],"commentLinesCount":[249,0],"fileCount":7,"commitsCount":8,"branchesCount":2,"forkCount":0,"stars":72}

If I am interested in source files in C programming language which is the 1st in the list of languages, I can read there are 1024 lines in 6 C source files, 561 lines of C code, 249 lines of comments and 208 empty lines.
I can verified it if I unpack the siva file with siva and clone the bare directory with git:

siva unpack siva/latest/06/06aa275d3a98e44cd5c1af8a72f1c3c220689ecf.siva bare
git clone bare worktree
cd worktree; git checkout `git branch -a | grep HEAD`

For a more large project like mongodb project, there are more counts and several siva files:

{"url":"git://github.com/mongodb/mongo.git","sivaFilenames":["1ac1389bda227bb64773bf10ba30107b3ed35231.siva","e346dd9c28af6491423a16e177b944da26d0b6e6.siva","e73188b5512c82290a4070af4afddac20d0b981e.siva"],"size":777410373,"license":"AGPL-3.0-only:0.683,AGPL-3.0-or-later:0.683,Apache-2.0:0.981,ECL-2.0:0.875,GPL-3.0-only:0.628,GPL-3.0-or-later:0.628,deprecated_AGPL-3.0:0.683,deprecated_GPL-3.0:0.628,deprecated_GPL-3.0+:0.628","langs":["Assembly","Batchfile","C","C++","CMake","CSS","CSV","DTrace","Diff","Dockerfile","Emacs Lisp","GDB","Gnuplot","Go","Gradle","Graphviz (DOT)","HTML","INI","JSON","Java","JavaScript","Lua","M4Sugar","Makefile","Markdown","Meson","Objective-C","PHP","Pascal","Perl","PowerShell","Protocol Buffer","Python","R","RPM Spec","Roff","Ruby","Rust","SQL","SVG","Shell","TOML","Text","Unix Assembly","Vim script","XML","YAML","reStructuredText"],"langsByteCount":[7372,75671,50858172,132565464,20613,75706,49219,1377,633389,517,1551,308,710,31119284,7009,1745,341779,3233,140741,300440,12516132,1647,506518,914782,511502,5389,192036,42516,26071,472586,82,4745,8523576,1064,179248,439323,9997,2281,75622,253487,2322113,3289,4536906,93020,651,526437,1284795,79993],"langsLinesCount":[340,2432,1270267,3637995,678,3165,558,35,17471,21,52,12,25,717986,281,87,7424,148,5020,8838,330249,88,13795,15963,11093,149,4683,1145,846,15297,6,227,230783,39,2758,16108,373,68,1388,504,78362,141,75954,5056,32,10138,37406,2408],"langsFilesCount":[2,31,1757,11946,7,13,9,1,28,1,1,1,1,1197,4,4,18,31,60,49,3187,2,39,87,104,2,70,7,1,17,1,1,1047,1,11,16,3,2,3,1,236,2,171,14,1,28,239,11],"emptyLinesCount":[235,0,76729,249985,74,621,0,0,0,0,0,0,0,29666,0,0,1097,0,1,1029,51271,14,0,1216,2371,0,0,48,79,1652,3,0,31622,0,0,0,48,12,0,0,0,28,0,0,0,5,2656,0],"codeLinesCount":[4372,0,683669,1281829,409,2425,0,0,0,0,0,0,0,620014,0,0,6194,0,4686,4965,223878,69,0,4863,8528,0,0,1065,428,11256,3,0,116323,0,0,0,280,47,0,0,0,84,0,0,0,1199,29939,0],"commentLinesCount":[604,0,231210,252412,163,111,0,0,0,0,0,0,0,66449,0,0,116,0,0,2795,52068,3,0,780,0,0,0,25,338,2372,0,0,55460,0,0,0,42,7,0,0,0,27,0,0,0,14,3838,0],"fileCount":22072,"commitsCount":46044,"branchesCount":3212,"forkCount":5,"stars":16463}

C programming langage is the 3rd in the list of languages, I can read there are 1.270.267 lines in 1.757 C source files, 683.669 lines of C code, 231.210 lines of comments and 76.729 empty lines.

But I can only extract 297 files (resp .11, 286 and 0 files) from the 3 siva files.

What is the exact meaning of fileCount ?
Are missing files referenced in other branches? (there is more than 1 branch with HEAD reference)

Thank you for your help

Hello Olivier,

Happy to see you find interest in PGA !

TL;DR: fileCount is the number of files in the HEAD of the main repository, it does not include files found in the HEAD of forks.

So, let’s get into it. The information printed out when using pga list is read from an index created by pga-create index, at tool you will find here. If we look at the output for mongodb, we see that there are 5 forks for this repository. This is why if you run pga siva list on each of the 3 Siva files and then grep refs/heads/HEAD you will find 6 references: one for the mongodb repo and one per fork. Similarly, if you run pga siva dump on each of the 3 Siva files, you will end up with:

[romain@moon] ~ $ ls  | grep 0169
0169ecdc-3c1b-fbb8-8b06-4b36632d0a30
0169ed06-7b28-e3cc-a0c4-0ea7f6979333
0169ed06-7b30-4f26-9d99-a0f34e665000
0169ed08-ddb4-8107-ff86-0dbf76d1d2f3
0169ed0c-7059-8f9b-57ee-3bf7fa954cef
0169ed10-6362-fd90-a22b-34c0aa248957

This command extracts the data contained in each commit which is a HEAD. In this case, the commit of interest is 0169ed06-7b30-4f26-9d99-a0f34e665000, as it is the one created from the HEAD of mongodb/mongo.git. If you count the number of files, you will find there are 21,276 files, a number close to the fileCount attribute of the index, which is 22,072. You may have noticed that if you sum the number of files from the langsFileCount attribute, you obtain a third number: 20,465.

Let me explain:

  • in order to classify files, we rely on src-d/enry. Although good, this tool is not flawless, and in some cases may be unable to classify files. This is the reason why there are less files in the langsFileCount then in the repository we extracted.
  • regarding why there appears to be ~800 files missing, there may be multiple reasons, ranging from removing certain large unidentified files, data loss during the creation of the dataset, etc. We try to improve the quality of the dataset at each iteration, and hopefully when the v3 of the version comes out you will be less inconvenienced.

The pga siva unpack command, which you used to create the bare repository, should extract all files content from the Siva (without duplication) and put them in the specified directory. This means that all versions of each file, independent of the commit, will be unpacked. By extracting it’s HEAD commit from the git data, or any other commit, you should be able to obtain the appropriate repository, minus errors on our part.

I hope this helps,
Cheers,
Romain

PS: We are currently creating a v2 of the pga CLI, which enables the download of Parquet files containing UASTs of the files in the HEAD commit. We will update the documentation accordingly soon.