Skip to content

public-law/open-gov-crawlers

Repository files navigation

Test Suite

Open-gov spiders written in Python

Crawlers and parsers for extracting legal glossary and regulation data from official government sources. This repository powers Public.Law’s free legal dictionary and statute archive. Each source has a dedicated spider, parser, and test module.

Source code Dataset
Australia Family, domestic and sexual violence... parser | spider | tests json
Australia IP Glossary parser | spider | tests json
Canada Dept. of Justice Legal Glossaries parser | spider | tests json
Canada Glossary of Parliamentary Terms for... parser | spider | tests json
Intergovernmental Rome Statute parser | spider | tests json
Ireland Glossary of Legal Terms parser | spider | tests json
New Zealand Glossary parser | spider | tests json
USA US Courts Glossary parser | spider | tests json
USA USCIS Glossary parser | spider | tests json
USA / Georgia Attorney General Opinions parser | spider | tests
USA / Oregon Oregon Administrative Rules parser | spider | tests

The Ireland glossary parser is the best example of our coding style. See the wiki for a technical explanation of our parsing strategy.

Example: Oregon Administrative Rules Parser

The spiders retrieve HTML pages and output well formed JSON. It represents the source's structure. First, we can see which spiders are available:

$ scrapy list

aus_ip_glossary
can_doj_glossaries
int_rome_statute
...

Then we can run one of the spiders:

$ scrapy crawl --overwrite-output tmp/output.json usa_or_regs

This produces:

{
  "date_accessed": "2019-03-21",
  "chapters": [
    {
      "kind": "Chapter",
      "db_id": "36",
      "number": "101",
      "name": "Oregon Health Authority, Public Employees' Benefit Board",
      "url": "https://secure.sos.state.or.us/oard/displayChapterRules.action?selectedChapter=36",
      "divisions": [
        {
          "kind": "Division",
          "db_id": "1",
          "number": "1",
          "name": "Procedural Rules",
          "url": "https://secure.sos.state.or.us/oard/displayDivisionRules.action?selectedDivision=1",
          "rules": [
            {
              "kind": "Rule",
              "number": "101-001-0000",
              "name": "Notice of Proposed Rule Changes",
              "url": "https://secure.sos.state.or.us/oard/view.action?ruleNumber=101-001-0000",
              "authority": [
                "ORS 243.061 - 243.302"
              ],
              "implements": [
                "ORS 183.310 - 183.550",
                "192.660",
                "243.061 - 243.302",
                "292.05"
              ],
              "history": "PEBB 2-2009, f. 7-29-09, cert. ef. 8-1-09<br>PEBB 1-2009(Temp), f. &amp; cert. ef. 2-24-09 thru 8-22-09<br>PEBB 1-2004, f. &amp; cert. ef. 7-2-04<br>PEBB 1-1999, f. 12-8-99, cert. ef. 1-1-00",
              }
            ]
          }
        ]
      }
    ]
  }

(etc.)

The Wiki explains the JSON strategy.

Development Environment Notes

Python 3.10

I'm using asdf because the Homebrew distribution is more up-to-date than pyenv.

Poetry for dependency management

So before I start working, I go into the virtual environment:

poetry shell

Making sure I have the current deps installed is always good to do:

poetry install

Pytest for testing

The pytest tests run easily:

pytest

Other tools

  • Java is required by the Python Tika package.
  • Pylance/Pyright for type-checking
  • Black for formatting

Dependencies; helpful links

Contributing

To add a new glossary crawler:

  1. Pick a source and add a new spider under public_law/spiders/.
  2. Write a parser in public_law/parsers/ that extracts terms and metadata.
  3. Add a test case under tests/public_law/parsers/.
  4. Run the spider using scrapy crawl --overwrite-output tmp/output.json.

Need help? Just ask in GitHub Issues or ping @robb.

Sponsor this project

Contributors 3

  •  
  •  
  •  

Languages