-
Notifications
You must be signed in to change notification settings - Fork 16
Open standards for data schemas and serialization formats #161
Comments
Taking a first stab at sussing out what the various tools and libraries are attempting to solve. Will update as I learn more.
|
During the past few sprints, we've been working on converting this pilot's data to XBRL, using the schemas posted here: http://fedspendingtransparency.github.io/data-exchange-standard/. The learning curve is steep, and the open-source XBRL ecosystem is not ideally robust. The open-source XBRL GUI we're playing with: Arelle The Python library used to create classes based on the schema files: generateDS. Had to manually install (PyPi version wasn't current; pip install from BitBucket didn't work)--seems a bit janky. Neither of these tools seems to have a large community or be actively updated. I'd be reluctant to rely on them. Is XBRL an example of an open standard that's too complicated to use without commercial tools? What are tools/best practices used by others in the open source community for converting data to XML/XBRL? |
This is great, and I'm glad that someone is thinking of this! I have a few initial thoughts to add in here now. I apologize for how scattered these messages are, but this is a surprisingly broad topic. I almost wonder if there is a better format we should be using for discussing this topic in public. Anyhow, some points for now in no particular order:
As a former data journalist, I have my own biases here, but it strikes me that heavily-convoluted schemas tend to favor the first audience and disadvantage the second. And I feel they tend to disadvantage the third because they make it harder for news organizations and other civic groups like treasury.io for instance to try to make complicated information more understandable to the general public.
|
One other thing that occurs to me literally after I submitted this last comment: even the most comprehensive level of validations (whether expressed as machine-readable schema relations or programmer-written post-loading tests) will often miss some caveats/issues that are about how the data is collected or how it should be interpreted by end users:
I'm sure there are other cases, but I guess the tl;dr point I want to make is that even the most automated validation mechanisms will not remove the necessity of creating documentation for this data and the requirement of users to read it. In some cases, automated validation might make things worse by making people assume that programming errors are caught automatically... |
It seems logical to to me to consider the overall manageability of a schema in terms of what needs to be done to make it usable to the (inevitably different) groups of users who benefit directly or indirectly from the schema's use. As +harrisj points out not all users will be "domain experts" in terms of process or technology. Also, whether a schema is "open source" or not may be secondary to whether not the schema is known and understood widely enough to support (or require) the development of maintainable tools to support its management and use. I've seen instances where the lack of tools to support schema adoption has significantly slowed standards adoption and hope that is not the case here. |
I don't see OpenSpending in that list. Their approach is basically:
It's currently being overhauled (a victim of its own success, they're reaching the limits of the current architecture), so you could consider helping work towards the new version. |
@harrisj Thanks for weighing in. Yeah, this is a big topic to tackle in a GitHub issue 😁 Your audience breakdown is particularly helpful. And to your point that "heavily convoluted" schemas favor some groups of users over others, it seems that the opposite is not true. In other words, if a simpler schema + explicit validations could accommodate the second and third groups, the first group doesn't lose anything. I added csv to the table up here. You're right--it should be there. |
@ddmcd Good point about a lack of tools slowing a schema's adoption--thank you. Do you have a preferred way of expressing a data standard/schema? |
@stevage Thanks for the pointer to OpenSpending. The use of .csvs as the lowest common denominator for transmitting data plus JSON for structure is intriguing. The CSV + structure is similar to the approach we're using in this project's prototype, except that the mapping is done internally before data is transmitted, so there's no external insight into how it was done. Some great stuff here to think about. |
FWIW, the W3C has been working on CSV standards for the web for a while. I haven't dug too deeply into the metadata stuff, so I can't speak to its usefulness here. But I would like to register my support for CSV generally, because it's the most "backward-compatible" format (and still one of the most compact representations) out there. |
Extensible is a confusing column for me, because that could mean a variety of things. I'm assuming it refers to the ability of users to add their own additional columns beyond what is provided in the base data file. It seems like there are some caveats for what is there:
I guess I have one other question about these formats and integer/numeric types and date types. Does it use binary formats? Text with some sort of explicit type declaration? Or just plain text with no explicit typing? |
@bsweger I think that restricting yourself to the use of only open sources tools for working with the data is a challenge. Open source is great for where it fits, but falls flat where it doesn't - which is often in areas without a very large/broad user base to have an interest in building and maintaining the open source tools. I don't think the community in this space is that large. I think that the community doesn't really care if there is an open source or proprietary tool so long as they have a tool for the job. I too have spent time looking for open source tools related to XBRL and have found few/none. My next thought, that is still just a thought, was to see if I could co-opt an open source, XML-ingesting ETL tool for the purpose. One other thing that you may want to consider in your analysis is the availability of learning materials and opportunities in the ecosystem. XBRL for example is a global organization with conferences and such. There is an entire track devoted to academia so there is research. I'm not extremely XBRL-fluent myself, having learned mostly from my interactions related to the DATA Act. It's essentially just an extended XML schema though - right? With governance process to manage it. Beyond the technical, something can be said for using a standard in use by other similar entities (government or financial) for similar purposes because all that use builds an ecosystem (open is nice, but getting the job done is better). For example, the use of automated compliance rules for regulators. Sure the rules could be hand-coded over and over in code to check the data.. but I'd think building the rules in the schema to be more efficient than in the code when it comes to re-use of the rule elsewhere. I'm thinking rule re-use and easy broad adoption, not code re-use. Rule re-use enables broad adoption and agreement on the rules through some sort of governance and compliance mechanism. A final thought, changing the model is often less impactful from an IT perspective than changing the code that runs the model (so long as the code truly uses the model and metadata vs hard coding). Again, I'm still diving into the topic in my spare time. Just sharing my thoughts. I wish I had more time to dig deeper! I look forward to the outcome! |
I don't know much about it yet, but kwalify applied to YAML or JSON seems to meet: Defining Structure/Type: y |
I think NIEM would be another option to consider. It's less technical and more process - but something already well-known in the government data exchange space. I'm again leaning towards the governance structure and process around the exchage protocol being more important than the technical under the hood. |
At Open Knowledge we've been working on very simple open source methods for shipping around data under the heading of "Data Packages" for some years now:
|
@catherinedevlin @HerschelC @rgrp above table has been updated with your suggestions--thank you! Let me know if I got anything wrong, and apologies for the somewhat awkward format of jamming all of this into a GitHub thread. The input is much appreciated! |
You might also take a look at Microformats: http://microformats.org/. |
Also, RDF (http://www.w3.org/RDF/) and RDFa (http://rdfa.info/). |
You might want to look at the Open Geospatial Constorium ( OGC) , http://www.opengeospatial.org/, which is a standards body specializing in open standards for geospatial data. A list fo the current OGC standards can be found at http://www.opengeospatial.org/standards/is My personal opinion on open standards is that it is not an open standard unless there is an open source reference implementation of the standard. OGC has several standards, but a couple that might be useful to look at would be the geopackage data format, http://www.geopackage.org/spec/ , the Web Mapping Service ( WMS ), the Catalogue Service (CSW) , and the Sensor Observation Service ( SOS) |
Having dealt with managing/converting hundreds of data formats and services over the last few years - there are notable advantages that @harrisj and @stevage mentioned. I think the thread is hard to digest because Open Source tools are not being distinguished from Open Data Formats/Services/API standards? Can someone clarify/delineate the discussion? as these things are different creatures... Focusing on the Data itself (i.e standards/specs/formats) of this - I would highly advise against using a XML based standard and would caution the use of CSV as a data exchange format: Avoid eXtendable Markup Languages (XML)
XML is really NOT that extendable because it derives data relationships from nesting data in a hierarchy (i.e. indentation). So adding new fields or remapping hierarchy isn't impossible - but it becomes very impractical for these reasons:
In other words there is too much rigidity in the structure/data types - rigidity is nice if things never change - but they always will. Avoid Deliminated formats (CSV) for Data ExchangeData formats where data is stored as text with a special character (:/,/;/tab/|/etc) deliminating between each value like CSV are 'file' formats convenient for download but ARE NOT useful/practical for data services/exchanges/APIs because the whole/contiguous file needs to be downloaded before it can be opened (i.e. you cannot stream portions or individual data entries separate from the whole dataset) . There can also be issues maintaining the data structure (i.e some rows have more columns or don't match to the correct column heading) if values in a field contain the same characters that are suppose to deliminate between those values. Use JSON as the Data Exchange formatBecause everyone already does and because it doesn't have the issues above ;) |
While a well documented binary format may not be human readable, it sure saves on bandwidth and storage! |
@JJediny asked...
You're right--there's a lot happening in this thread. A focus on the open data formats/services/standards rather than tools is the intent. Thanks for your great points! |
ICA Commision on Geoinformation Infrastructures and Standards |
I would argue that NIEM (not NEIM) is Human-Friendly (and should be "y") given a rich set of requirements around developing conformant information exchanges (i.e. IEPDs) by focusing on required technical schemas AND business documentation artifacts for reuse. Reuse is only promoted via a human-friendly understanding of the exchange. The NIEM model itself has open source tools that makes searching model content easy and offers type and element definitions to aid in understanding content meaning. Lastly, a UML Profile for NIEM exists to aid business users with understanding content relationships. Data Serialization should be "y" as XML instances are created from NIEM-conformant XML schemas with many implementation examples across industry, federal, state, local and international governments. |
I suggest renaming this issue to: Open standards for data schemas and serialization formats
When talking about open data standards, it's usually assumed that the standard specifies not only the vocabulary but also the schema and the serialization format, even if multiple serialization formats are supported for a given schema. If you look at the schemas on schema.org you can see some good examples of this breakdown. These schemas are made out of a broad vocabulary of terms (often reusing established vocabularies like Dublin Core) which are then assembled into different schemas. Look at the Government Service schema as an example and at the bottom of the page you'll see examples of different serialization formats to encode the schema including JSON, and HTML microdata. NIEM tries to take a similar approach as Schema.org for defining a vocabulary and schemas for use in government that are independent of serialization format. Originally the main serialization format for NIEM schemas was XML, but they can also be represented in JSON and other formats. For what it's worth, I think NIEM and the DATA Act should aspire to the simplicity of the documentation of schemas on schema.org. I don't think it's acceptable to only publish the complete schema documentation as Word documents and XSD files and it's almost impossible to find the actual terms and schemas on the NIEM website. The openness question with data standards is somewhat related to open source software, but it's helpful to keep them separate. It is true that many standards efforts require an open source reference implementation in order to consider it an open standard, but it's also possible to have an open standard (using the Bruce Perens definition) without there being usable (or widely used) open source tools. Proprietary data standards on the other hand often prevent the existence of open source tools to serve them in the first place (without reverse engineering or license fees). In government, we're typically required to use open standards (see A-119), so that shouldn't really be a question. I think the question you're asking is what open standards have robust ecosystems of developers and tools that support them, including open source options. From what I can tell, it looks like you've already established most of the vocabulary as seen on the data elements page but as far as defining a schema and serialization to enable machine readability and machine interoperability of the vocabulary, what we currently have is an example of how this could be done using XBRL as seen on the data exchange page. I haven't spent much time with XBRL but my understanding is that it is a data standard in a way that encompasses vocabulary, schema (XSD), and serialization formats (XML) and the example we have now extends XBRL with the terms that have been defined for the DATA Act vocabulary. It appears that there's a lot of activity around XBRL and ongoing updates to the standard (which date back to 2003), but it's not clear to me whether this legacy is helpful or not. It seems like XBRL is viewed as complex, cumbersome, lacking a robust ecosystem of tools and developers, and perhaps not widely used by relevant audiences. That said, usage of XBRL in other areas may warrant XBRL-based interoperabilty with DATA Act data. It doesn't necessarily have to be one or the other, but it's clear that there are some issues with an XBRL-only approach. The opposite extreme in terms of simplicity and a robust ecosystem of tools and developers is the CSV. One of the things that makes CSV's such a great format is that they're incredibly easy to use for both developers and non-developers and they probably have the broadest range of tooling available. The downside to CSV's is that they don't provide a schema which is needed not only for validation and ensuring interoperability, but also for helping developers understand the data. Fortunately, the recent aforementioned efforts like Tabular Data Package (hat tip @rgrp) look like they're doing a great job of addressing this by combining CSVs with a schema defined in JSON using JSON Table Schema (which builds on JSON Schema). @shawnbot also mentioned the W3C's CSV work, but the Metadata Vocabulary for Tabular Data spec is actually based on JSON Table Schema and shares authorship, but it seems that JSON Table Schema is more actively used and developed at this point. The Tabular Data Package seems like a solid candidate for the canonical data schema and serialization, but once you have that it's fairly trivial to generate derivatives in other serialization formats including JSON or XML. As mentioned earlier, the OpenSpending Budget Data Package Standard uses this approach, but so does the Open Contacting Data Standard. Remember the Schema.org Government Services schema I mentioned earlier? The OpenReferral project is also building on this as a standard with the Human Services Data Specification which uses Tabular Data Package as well. While Tabular Data Package is still fairly new, it does already seem to come with good tooling (e.g. OKFN's Good Tables validation service/code) and the fact that it builds on existing well established ecosystems for CSVs and JSON Schema is also quite encouraging. I do think it's critical to provide a machine readable schema for validation that's code-independent if you want to foster a robust ecosystem of tools that support machine to machine interoperability. The Project Open Data schema is defined using JSON Schema which makes it much easier for multiple codebases to stay in sync whenever changes to the schema are made. This also makes it possible for others to implement validation so we don't all have to depend on one central validation service (though it is important to provide a public one - as Data.gov does). Several agencies have used the canonical JSON Schema for POD Schema within their own codebases and even within Data.gov we use the same schema file with several different code bases in multiple programming languages. Once you have the basic schema and serialization in place (e.g. Tabular Data Package), you probably want to think of more sophisticated ways to make the data available such as a RESTful API that returns queries in JSON, but I think it's important to start with raw bulk data interoperability before we start thinking about standard ways of exposing that as an API. Fortunately, some of the components may be reusable (e.g. you could reuse the JSON Schema definitions from JSON Table Schema for your API standard definition in Swagger). One other important consideration that I haven't seen discussed much is file size and transmission rate. Proto Buffers are probably more valuable when bandwidth reduction, realtime transmission, and parsing efficiency is important, but the serialization format is relevant for large files even when you're not dealing with a network. Stream processors for both JSON and XML are widely available, but they're always going to be more complicated than streaming a CSV which can just be read line by line. Efforts like JSON Lines are trying to make this easier for JSON, but in most cases (if you're not using a special stream processor) you have to read an entire JSON or XML payload into memory before you can parse it which can be much more limiting than parsing a CSV line by line. Normalization versus de-normalization is also worth considering. JSON and XML are often used to represent de-normalized relational data whereas a Tabular Data Package would probably be normalized across multiple CSVs just like multiple tables in a relational database. As a bulk data format, my guess is that most people are going to load this data into a spreadsheet or database anyway, so I would suspect normalization is what you want. Plus, normalization reduces file size. Maybe once you're at the API stage de-normalization will be more appropriate. |
I vote YES for an API that allows [1] a parameter to specify a response in either JSON or XML, [2] JSON response with a supported option for a callback parameter, which should allow robust responses depending on the requesting application/framework. |
Financial data gets big quite quickly (millions/billions of transactions, time points, people, ...), so I'm pretty confused why only one format relevant to nation-scale data is considered. |
(Most of those formats seem more relevant for messaging, and poor for sharing anything sizeable or allowing compute over it.) |
+1 to what @philipashlock said. It's also pretty dependent on the data and the audience(s). Here's my general, straight-to-the-point thoughts on opening up data: |
From the Python and Hadoop side of things, I can attest that CSV and JSON alone are problematic from a tooling and fidelity-of-analysis point of view as canonical formats. JSON is acceptable as long as an official schema is published so that the JSON data can be transformed into a binary format like Avro or Parquet for analysis. For larger datasets, Parquet is attractive because it can be analyzed efficiently with Spark, Hive, Presto, Drill, or Impala without need for ETL into columnar format. Making data exportable in Avro has become an increasingly popular choice — Google BigQuery utilizes it as one of their binary export formats. So, I would recommend making JSON and a binary-format with a built-in schema (e.g. Avro) both available to data consumers. |
What are the thoughts on HDF5 as a binary format? It supports @bsweger's attributes with the exception I guess of "Human Friendly". One aspect of HDF5 that may be useful is that is supports object relationships that aren't strictly tree-based (i.e. the collection of HDF5 groups may contain cycles). I've been working on a JSON representation of HDF5 (and related tools). See: https://github.com/HDFGroup/hdf5-json. |
@jreadey in general I've found interest / development in tools based on HDF5 has dropped off considerably in the past 5 years. Notably PyTables was put up for adoption by its creator. IMHO: the biggest issue with HDF5 is that it requires a file pointer and the ability to do random access anywhere in the file. This means you can't access data in a file without having all of it (vs. formats developed for Spark / Hadoop which all support streaming reads). |
On the other hand, I'm sure what other binary format is poised to replace HDF5. I'm working on the file pointer aspect. :) ... a REST service that exposes PyTables-style queries. E.g: this url: https://data.hdfgroup.org:7258/datasets/651e787e-16a6-11e5-8632-06fc179afd5e/value?query=temp%3E63&host=compound.test.data.hdfgroup.org pulls rows where the 'temp' field is greater than 63. |
@wesm have you see the Tabular Data Package as that may help solve the fidelity issues with CSV for you (basically by providing a mechanism to add schema / type metadata around that CSV). There's currently quite a bit of work underway to provide integration of Tabular Data Package into other systems ranging from R to BigQuery. |
For this particular case of the DATA Act reporting, there are two distinct use cases:
And the responses in this thread have really focused on the latter. It seems like a very straightforward and accessible format is required for lay-audiences (probably CSV), but also a binary representation (like protobufs or avro) is desired. And perhaps JSON because of its wide adoption and use. So, I guess the answer is, (almost) all of the above :) ? For the formats that don't include inherent type validation (CSV and JSON) I'd be curious how end users would want to consume metadata on the type and domain for different fields. |
@kaitlin I would agree "all of the above" since it's just output of validated, high-quality data. It's simple enough to export into any format from a standardized schema. I see three general segments for the publishing, the people perusing the website that may request a dump to play with in Excel, people that want bulk downloads for deeper analytics, and people that want api access in a messaging type channel (transactional, short messages). The truly lay-people would use the built in reporting on the website (and likely an export to Excel so CSV is good, though .xlsx is essentially a set of zipped files which if offered could save on bandwidth usage). The second set that I mention are people requesting data dumps. These are generally a little more advanced and would have the necessary skills to extract from whatever format you create. This isn't the lay-person audience. I'd thus suggest something most efficient for quality data download (compressed, a binary with error checking). It'd be great to offer open source packages to load this data into whatever platform to further enable analysis. The theme for this segment is efficient data transfer - compressed with error checking from beginning to end of transmission. I'd be curious of any benefit between using one of the binary formats suggested versus using a standard text-based format (csv, xbrl, json) and then gzip'ing similar to today. I'd lean towards simplicity. The third set of users are those that want a messaging type interface to connect an app or dashboard to the data for querying. This of course would be best-suited for JSON. I think that the design should keep in mind that financial data for public companies is published using XBRL. It makes some sense to provide this as an option for consuming data since those in this analysis space would already have the tools to consume the government data. Again, output options should be "all of the above" if possible. The key is that the data, in whatever format, is validated BEFORE it's output. So the output format has little meaning - it's more for convenience if the data is truly of high quality. If it is not, then the consumers of the data would want something more robust to allow them to validate the data - like a schema that ensures quality of output. So much government open data now is just a push of data with little regard to quality. Financial data is easy to reconcile (all debits and credits that balance to zero, right?). I'm hoping this effort will be a gold standard for data quality in government open data. The best way to ensure quality output would be to use a schema that enforces the quality, IMHO. To answer your last question - simple reference data sets for the metadata, offered also in a plethora of formats is fine. This is akin to what we would use when applying DQ checks to the data in an ingest process. We'd bounce the data against the reference table to ensure it fits within the expected parameters. If not, flag it as a defect. Finally, there are also historical changes to data to consider related to this conversation on data outputs. Some consideration should be given to how this change in data is handled over time. Directly changing it loses the value of the data as it was. Knowing what changed could itself be of value. For example, last week someone produces a report on data from 2013. This week that data is updated by the agency. I'm looking at the report that was done last week and the data doesn't match the report any more. Why? The answer should be that X agency updated the data and I would be able to reconcile the old report based upon this change in data. Changing data without capturing the fact that it changed can lead to a perception of poor quality data. |
@kaitlin, regarding CSV and JSON, that could be aided by using the CSVW and JSON-LD specs. It doesn't solve the schema problem, but there's really no way to avoid defining a schema for this kind of output anyway. My understanding is that the CSVW group (and others) are trying to ensure that JSON-LD, CSVW, Table for the Web, etc, are aligned so I suspect that once you have the schema for one, you either have a schema for the other or are very close. I'd say +1 for including one of the interoperable standards in addition to any of the data streaming options. If nothing else, I can interact with those outputs without any special tooling if I want/need to so it's a pretty low bar compared to HDF5/ProtocolBuffers/AVRO/etc. And to make that consideration in a more language-agnostic way (python tooling is great, python is not the only language option). |
@wesm I think the PyTables comments miss the mark a little. We have done a ton of work and a lot of innovative things with PyTables since it went from being Francesc's baby to a community code. |
@scopatz that is a fair criticism. I am seeing little HDF5 interest in the BI / analytics / enterprise data management ecosystem which is where I've been focused the last few years. That doesn't really reflect what's been going on in scientific computing, though. |
Required watching for this thread: +1 to close this ticket - as its abit too unwieldy to be productive - I'd suggest (politely of couse) either redirect to other open issues or create a new issue that provides a means to be most productive for those interested in contributing - of which I hope to be one. |
@wesm Yes, I agree. The two domains are very different and require distinct technical solutions at this point in history, IMHO. |
Thank you, all, for sharing your thoughts here. It's immensely helpful to hear from people in and outside of government. Special thanks to @philipashlock for parsing the original issue into its separate components and walking through them. @JJediny agreed with your polite request! I'll summarize my take-aways and would love your thinking on how to split this discussion into more focused pieces.
This project is very much focused on the first type above, so that might be a good way to split this up. Thoughts? |
And can't resist one last question before we re-org this. @philipashlock suggests that validating data via a machine-readable schema has some advantages over a centralized validation service. How do people handle complex validations in a machine-readable schema? For example, if we want to make sure that the sum of values in three different records doesn't exceed an amount in a fourth record? Are you doing those types of validations outside the schema, or do your schemas support those? It seems like XBRL can handle those, though the implementation isn't especially straightforward. |
Think the github issue could be distinguished by having clear line between some of the subliminal/embedded concepts within the thread. A few potentials being:
I think the best use of Open Standards is when they are explicitly named/required in contracts/task orders. If we can get to the point where we can insert a requirement for a two-way API using an Open Standard - we can ensure a greater level of integrity and risk by avoiding poorly planned/executed customizations to meet the need of a customer/user at the expense of broader interoperability. This gets back to the point about the interplay between Open Source and Open Data. I believe the best way to try to conceptualize the interplay is to know that Open Source communities develop around the perceived need for a solution - the fast means to the solution is the path generally taken. As most if not all Open Source projects are really just a lego creation made up of other Open Source components/libraries/etc and/or a patch work of multiple independent projects themselves with their own communities. Open Source communities therefore tend to rely on existing Open Standards - not for a lack of creativity - but because that's something they don't need to worry about (time/money) - which makes a project more attractive to people evaluating options as it is less new learning and more doing. Some examples of technical standards that encourage reuse and interoperablity - Standards - that can be "required":
When such Standards are not used however - it can be an incentive to vendors/contractors to customize these aspects of a project. This means more resources wasted on reinventing a wheel concurrently making it more costly down the road as formats/services all have to be continually be customized/maintained/remapped in the software and conceptual (training) to those that use it or those that need to consume or obtain the data - this is when issues like vendor lock-in occur - when it becomes 'harder' to solve the issue then it is to continue whats been done... With a solid Open Standard no matter many vendors/FOSS can take a variety of approaches within their software - so long as it doesn't break the standard and the data models integrity - I think if you can ensure conformity at the data exchange level you enrich the ecosystem for Open Source Communities & Commercial Vendors alike. |
I do think we need an simple Interface for these operations which defines the basics like https://packagist.org/packages/persis/common |
Opening this issue to track thoughts on open source data schemas/standards/formats.
General Questions
Questions for specific data standard/schema options:
These are schemas/standards/formats we know about. What are others?
The text was updated successfully, but these errors were encountered: