Skip to content
/ benford Public

An exploration of Benford's Law for Arch Linux package upgrades

Notifications You must be signed in to change notification settings

yuszuv/benford

Repository files navigation

Benford's Law

A beginner's journey to using AI for data exploration.

State of the art as of: 2025/03/14

Motivation

Inspired by watching pacman upgrade logs and having watched a fantastic TV series about maths, I tried out, what Cursor IDE is able to do and to help me to prove (or refute), that "Benford's Law" holds for the package's file sizes.

In a first session, the AI returned amazing plots ... that contained a major bug, because the script also recognizes MiB/s as a file size, which I - in turn - noticed by running a AI generated jupyter notebook.

This notebook can be used to produce these two insightfull plots

Note: I did the part with the logarithmic scale, Cursor failed on that dramatically.

Jupyter Notebook

As Cursor IDE told me:

  1. First install Jupyter Notebook if you haven't already: pip install notebook numpy matplotlib

  2. Then launch Jupyter Notebook by running: jupyter notebook (not by trying to run the .ipynb file directly)

  3. This will open your web browser with the Jupyter interface. Navigate to where you saved your notebook. ipynb file and click on it to open.

Other findings

Cursor IDE prompts

In previous chat (whose history I have delete unfortunatley) i explored a pacman.log to reason about, if "Benford's Law" holds for the sizes of arch linux package upgrades. On the way I created (in that order), benford.py, histogram.py and histogram.go. At the end, I chatted with you about, why the law probably doesn't hold for small and very large file sizes.

help me to turn my results into a jupyter notebook.

returned more or less the complete notebook here

there's a bug in the script. the regular expression also matches the the download speed (the column next to the size). correct this

pattern = r'(\d+(?:\.\d+)?)\s*(MiB|KiB)\s+[^\n]*$'

use a look ahead assertion instead

pattern = r'(\d+(?:\.\d+)?)\s*(MiB|KiB)(?=\s+\d+(?:\.\d+)?\s*MiB/s)'

use a negative look ahead asseration to match any but "MiB" followed by "/s"

pattern = r'(\d+(?:\.\d+)?)\s*(MiB|KiB)(?!\s+MiB/s)'

did you mean pattern = r'(\d+(?:\.\d+)?)\s\*(MiB|KiB)(?!/s)'?

returned a praise of me

add this negative look ahead assertion to the go version

sizeRegex := regexp.MustCompile((\d+(?:.\d+)?)\s*(MiB|KiB)(?!/s)), which doesn't work

this returns: panic: regexp: Compile((\d+(?:\.\d+)?)\s*(MiB|KiB)(?!\/s)): error parsing regexp: invalid or unsupported Perl syntax: (?!

   sizeRegex := regexp.MustCompile(`(\d+(?:\.\d+)?)\s*(MiB|KiB)`)
   speedRegex := regexp.MustCompile(`/s`)

   ...

       if len(matches) >= 3 && !speedRegex.MatchString(matches[0]) {
   ...

About

An exploration of Benford's Law for Arch Linux package upgrades

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published