Research Methodology

How We Counted GitHub Users

Stratified random sampling across GitHub's full numeric ID space — 7 strata, bootstrap confidence intervals, unbiasedness verified.

Sampling Approach

GitHub assigns sequential numeric IDs to every user and organization. As of February 2026, the maximum observed ID (frontier) is 262,206,000. Not all IDs are valid — some accounts have been deleted, suspended, or were never assigned.

We divide the ID space into 7 strata and randomly sample within each, checking via the GitHub API whether each ID resolves to a real account. The validity rate per stratum estimates the total valid population.

16,000
Random Samples
81.8%
Validity Rate
8,000
Ground Truth IDs
1000
Bootstrap Iters

Strata Definitions

The ID space is partitioned into 7 independent ranges, sampled in proportion to their size.

F1
1 – 10M
10M· Earliest accounts
F2
10M – 50M
40M· 2012–2015 growth era
F3
50M – 100M
50M· 2016–2018 expansion
F4
100M – 150M
50M· 2019–2020 boom
F5
150M – 200M
50M· 2021–2022 growth
F6
200M – 250M
50M· 2023–2024 era
F7
250M – now
live ↗· Latest accounts (2025+)

Mathematical Framework

Five equations — from estimator design to confidence interval to unbiasedness proof.

1

Stratified Estimator

Total valid users = each stratum's estimated count, summed:

M_h = stratum size · p̂_h = validity rate in stratum h

2

Variance

Analytical variance of the stratified estimator:

3

Proportional Allocation

Each stratum's sample size is proportional to its share of the total ID space:

4

Bootstrap 95% CI

Resample 1,000× within each stratum. CI = [2.5th, 97.5th] percentile of bootstrap distribution:

5

Unbiasedness

The sample proportion is an unbiased estimator of the true proportion, so the composite estimator is unbiased:

Estimator Properties