Research Methodology

How We Counted GitHub Users

Stratified random sampling across GitHub's full numeric ID space — 7 strata, bootstrap confidence intervals, unbiasedness verified.

Sampling Approach

GitHub assigns sequential numeric IDs to every user and organization. As of February 2026, the maximum observed ID (frontier) is 262,206,000. Not all IDs are valid — some accounts have been deleted, suspended, or were never assigned.

We divide the ID space into 7 strata and randomly sample within each, checking via the GitHub API whether each ID resolves to a real account. The validity rate per stratum estimates the total valid population.

16,000

Random Samples

81.8%

Validity Rate

8,000

Ground Truth IDs

1000

Bootstrap Iters

Strata Definitions

The ID space is partitioned into 7 independent ranges, sampled in proportion to their size.

1 – 10M

10M· Earliest accounts

10M – 50M

40M· 2012–2015 growth era

50M – 100M

50M· 2016–2018 expansion

100M – 150M

50M· 2019–2020 boom

150M – 200M

50M· 2021–2022 growth

200M – 250M

50M· 2023–2024 era

250M – now

live ↗· Latest accounts (2025+)

Mathematical Framework

Five equations — from estimator design to confidence interval to unbiasedness proof.

Stratified Estimator

Total valid users = each stratum's estimated count, summed:

M_h = stratum size · p̂_h = validity rate in stratum h

Variance

Analytical variance of the stratified estimator:

Proportional Allocation

Each stratum's sample size is proportional to its share of the total ID space:

Bootstrap 95% CI

Resample 1,000× within each stratum. CI = [2.5th, 97.5th] percentile of bootstrap distribution:

Unbiasedness

The sample proportion is an unbiased estimator of the true proportion, so the composite estimator is unbiased: