Open-gov spiders written in Python

Crawlers and parsers for extracting legal glossary and regulation data from official government sources. This repository powers Public.Law’s free legal dictionary and statute archive. Each source has a dedicated spider, parser, and test module.

		Source code	Dataset
Australia	Family, domestic and sexual violence...	`parser` \| `spider` \| `tests`	`json`
Australia	IP Glossary	`parser` \| `spider` \| `tests`	`json`
Canada	Dept. of Justice Legal Glossaries	`parser` \| `spider` \| `tests`	`json`
Canada	Glossary of Parliamentary Terms for...	`parser` \| `spider` \| `tests`	`json`
Intergovernmental	Rome Statute	`parser` \| `spider` \| `tests`	`json`
Ireland	Glossary of Legal Terms	`parser` \| `spider` \| `tests`	`json`
New Zealand	Glossary	`parser` \| `spider` \| `tests`	`json`
USA	US Courts Glossary	`parser` \| `spider` \| `tests`	`json`
USA	USCIS Glossary	`parser` \| `spider` \| `tests`	`json`
USA / Georgia	Attorney General Opinions	`parser` \| `spider` \| `tests`
USA / Oregon	Oregon Administrative Rules	`parser` \| `spider` \| `tests`

The Ireland glossary parser is the best example of our coding style. See the wiki for a technical explanation of our parsing strategy.

Example: Oregon Administrative Rules Parser

The spiders retrieve HTML pages and output well formed JSON. It represents the source's structure. First, we can see which spiders are available:

$ scrapy list

aus_ip_glossary
can_doj_glossaries
int_rome_statute
...

Then we can run one of the spiders:

$ scrapy crawl --overwrite-output tmp/output.json usa_or_regs

This produces:

{
  "date_accessed": "2019-03-21",
  "chapters": [
    {
      "kind": "Chapter",
      "db_id": "36",
      "number": "101",
      "name": "Oregon Health Authority, Public Employees' Benefit Board",
      "url": "https://secure.sos.state.or.us/oard/displayChapterRules.action?selectedChapter=36",
      "divisions": [
        {
          "kind": "Division",
          "db_id": "1",
          "number": "1",
          "name": "Procedural Rules",
          "url": "https://secure.sos.state.or.us/oard/displayDivisionRules.action?selectedDivision=1",
          "rules": [
            {
              "kind": "Rule",
              "number": "101-001-0000",
              "name": "Notice of Proposed Rule Changes",
              "url": "https://secure.sos.state.or.us/oard/view.action?ruleNumber=101-001-0000",
              "authority": [
                "ORS 243.061 - 243.302"
              ],
              "implements": [
                "ORS 183.310 - 183.550",
                "192.660",
                "243.061 - 243.302",
                "292.05"
              ],
              "history": "PEBB 2-2009, f. 7-29-09, cert. ef. 8-1-09<br>PEBB 1-2009(Temp), f. &amp; cert. ef. 2-24-09 thru 8-22-09<br>PEBB 1-2004, f. &amp; cert. ef. 7-2-04<br>PEBB 1-1999, f. 12-8-99, cert. ef. 1-1-00",
              }
            ]
          }
        ]
      }
    ]
  }

(etc.)

The Wiki explains the JSON strategy.

Development Environment Notes

Python 3.10

I'm using asdf because the Homebrew distribution is more up-to-date than pyenv.

Poetry for dependency management

So before I start working, I go into the virtual environment:

poetry shell

Making sure I have the current deps installed is always good to do:

poetry install

Pytest for testing

The pytest tests run easily:

pytest

Other tools

Java is required by the Python Tika package.
Pylance/Pyright for type-checking
Black for formatting

Dependencies; helpful links

The Scrapy Playbook

Contributing

To add a new glossary crawler:

Pick a source and add a new spider under public_law/spiders/.
Write a parser in public_law/parsers/ that extracts terms and metadata.
Add a test case under tests/public_law/parsers/.
Run the spider using scrapy crawl --overwrite-output tmp/output.json.

Need help? Just ask in GitHub Issues or ping @robb.

Name		Name	Last commit message	Last commit date
Latest commit History 297 Commits
.github		.github
bin		bin
config		config
docs		docs
public_law		public_law
script		script
tests		tests
typings		typings
.gitignore		.gitignore
.python-version		.python-version
.tool-versions		.tool-versions
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scrapinghub-requirements.txt		scrapinghub-requirements.txt
scrapinghub.yml		scrapinghub.yml
scrapy.cfg		scrapy.cfg
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open-gov spiders written in Python

Example: Oregon Administrative Rules Parser

Development Environment Notes

Python 3.10

Poetry for dependency management

Pytest for testing

Other tools

Dependencies; helpful links

Contributing

About

Sponsor this project

Contributors 3

Languages

public-law/open-gov-crawlers

Folders and files

Latest commit

History

Repository files navigation

Open-gov spiders written in Python

Example: Oregon Administrative Rules Parser

Development Environment Notes

Python 3.10

Poetry for dependency management

Pytest for testing

Other tools

Dependencies; helpful links

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Sponsor this project

Contributors 3

Languages