Creative usages of bisect/binary search

I gave a talk at the ljubljana meetup in July 2026 about bisect, also known as binary search.

The talk started with presenting a number guessing game, which is at whatsmynumber (go play!)

In this post I’ll show some creative usages of bisect, and specifically how to implement them in python.

\(\sqrt{y}\)

For positive whole square numbers, like 1, 4, and 3025, the square root of \(x\) is between \(0\) and \(x+1\).

We want to use python’s bisect.bisect_left, which gets a list-like object, an entry to put in, a lower bound and an upper bound.

Let’s transform the question “what is the square root of 3025?” into “what is the smallest number \(x\) such that \(x^2\ge 3025\)?”

Here’s how we answer that second question in python using bisect.bisect_left:

import bisect

class Sqrt:
  def __getitem__(self, x):
    return x**2 >= 3025

bisect.bisect_left(Sqrt(), True, 0, 3025+1)

Our list-like Sqrt() object, when called with [x], answers our new question. We want to find the smallest \(x\) for which Sqrt()[x] is True, and that’s what we’re telling biesct_left with the second argument True. We know the answer is between 0 and 3025+1, which are the next two arguments.

Running the above gives the correct answer of \(55\).

Optimizing number of threads

Say we have a large number of json files to process, and we want to find the optimal number of threads to do so.

We expect the throughput, meaning the number of processed files per second, to grow when we add a second thread, a third third, and so on, until there are too many threads, and then it gets worse with each new thread.

This phenomenon is sometimes called “thread contention”. The threads are fighting for the same limited resources, in this case cpu time.

We can transform the question “what’s the optimal number of threads?” into “what is the least number of threads at which point adding another one is worse?”

import bisect
import time
import functools
from concurrent.futures import ThreadPoolExecutor

pretend_json_files = range(1_000_000)

def pretend_process_file(filepath):
    time.sleep(0.2) # pretend file read
    y = sum(range(1_000_000))  # pretend json processing

@functools.cache
def throughput_of(n_threads, number_files=100):
    with ThreadPoolExecutor(n_threads) as pool:
        t1 = time.perf_counter()
        list(pool.map(pretend_process_file, pretend_json_files[:number_files]));
        t2 = time.perf_counter()
        return number_files/(t2 - t1)

class IsWorse:
    def __getitem__(self, n):
        return throughput_of(n + 1) < throughput_of(n)

# exponential search
n = 1
while not IsWorse()[n]:
    n *= 2

n = bisect.bisect_left(IsWorse(), True, 1, n)

Our processing function is pretend_process_files. We have a function to measure the processing throughout given a number of threads, throughput_of, and we cache its results so that don’t compute times for a given number of threads more than once.

Our list-like IsWorse() object checks where adding one more thread makes the throughput worse.

In order to use bisect we need an upper bound on the number of threads, and we don’t have one!

You can say the number cores the machine has, but this makes sense if processing the json is mostly cpu work. If processing is dominated by reading from disk or network, having more threads is often better. So we do a quick exponential search for an upper bound.

Finally, we call bisect_left to hone in on the actual number of threads.

We can note that throughput isn’t really deterministic, making this code too simplistic. A more sophisticated question could be “is adding one more thread faster in a statistically significant measure?”

Compressing files into 200MB chunks

For data pipelines that I write I like having tar.gz files (*) of batches of data files I’m processing.

For example, if I’m processing a million images for product classification.

It’s nice to have the files around the same size, say 200MB. How do we make that happen?

Bisect!

import bisect
import random
import tarfile

files = [(f"image{i}.jpg", random.randbytes(32768)) for i in range(10000)]

def files_to_targz(files):
    buf = io.BytesIO()
    with tarfile.open(mode="w:gz", fileobj=buf) as tar:
        for name, data in files:
            info = tarfile.TarInfo(name=name)
            info.size = len(data)
            tar.addfile(info, io.BytesIO(data))
    return buf.getvalue()

class archive_over_200MB:
    def __getitem__(self, n):
        return len(files_to_targz(files[:n])) > 200*2**20

bisect.bisect_left(archive_over_200MB(), True, 0, len(files)+1)

(*) actually, I used to like tar.gz, but it doesn’t support random access. zip files do and I’ve moved to mostly using them instead.

What’s my number?

We can put whatever we want in our list-like object and bisect as we please.

And so, I have chosen a number, it’s at drorspei.com/whatsmynumber.

How do we do this with bisect_left?

import httpx
import bisect

token = httpx.get(f"https://drorspei.com/whatsmynumber/get_token", timeout=10).json()['token']

class guess_over_n:
    def __getitem__(self, x):
        resp = httpx.get(f"https://drorspei.com/whatsmynumber/guess?token={token}&n={x}")
        assert resp.status_code == 200
        assert 'no guesses left, refresh!' not in resp.text
        return 'too large' in resp.text

i = 1
while not guess_over_n()[2**i]:
    i += 1

bisect.bisect_left(guess_over_n(), True, 0, 2**i)

But does this find my number in time?

Good luck!