How We Counted GitHub Users
Stratified random sampling across GitHub's full numeric ID space — 7 strata, bootstrap confidence intervals, unbiasedness verified.
Sampling Approach
GitHub assigns sequential numeric IDs to every user and organization. As of February 2026, the maximum observed ID (frontier) is 262,206,000. Not all IDs are valid — some accounts have been deleted, suspended, or were never assigned.
We divide the ID space into 7 strata and randomly sample within each, checking via the GitHub API whether each ID resolves to a real account. The validity rate per stratum estimates the total valid population.
Strata Definitions
The ID space is partitioned into 7 independent ranges, sampled in proportion to their size.
Mathematical Framework
Five equations — from estimator design to confidence interval to unbiasedness proof.
Stratified Estimator
Total valid users = each stratum's estimated count, summed:
M_h = stratum size · p̂_h = validity rate in stratum h
Variance
Analytical variance of the stratified estimator:
Proportional Allocation
Each stratum's sample size is proportional to its share of the total ID space:
Bootstrap 95% CI
Resample 1,000× within each stratum. CI = [2.5th, 97.5th] percentile of bootstrap distribution:
Unbiasedness
The sample proportion is an unbiased estimator of the true proportion, so the composite estimator is unbiased: