Parsing: http://microformats.org { "items": [ { "type": [ "h-feed" ], "properties": { "category": [ "indieweb", "microformats2", "microformats2", "indieweb", "microformats2" ] }, "id": "content", "children": [ { "type": [ "h-entry" ], "properties": { "name": [ "How to Consume Microformats 2 Data" ], "url": [ "https:\/\/microformats.org\/2022\/02\/19\/how-to-consume-microformats-2-data", "https:\/\/microformats.org\/2022\/02\/19\/how-to-consume-microformats-2-data" ], "updated": [ "2022-02-19T11:48:15" ], "content": [ { "html": "
A (very) belated follow up to Getting Started with Microformats 2<\/a>, covering the basics of consuming and using microformats 2 data. Originally posted on waterpigs.co.uk<\/a>.<\/p>\n\n More and more people are using microformats 2 to mark up profiles, posts, events and other data on their personal sites, enabling developers to build applications which use this data in useful and interesting ways. Whether you want to add basic support for webmention comments to your personal site, or have ambitious plans for a structured-data-aware-social-graph-search-engine-super-feed-reader, you\u2019re going to need a solid grasp of how to parse and handle microformats 2 data.<\/p>\n\n To turn a web page containing data marked up with microformats 2 (or classic microformats, if supported) into a canonical MF2 JSON data structure, you\u2019ll need a parser.<\/p>\n\n At the time of writing, there are actively supported microformats 2 parsers<\/a> available for the following programming languages:<\/p>\n\n Parsers for various other languages exist, but might not be actively supported or support recent changes to the parsing specification.<\/p>\n\n There are also various websites which you can use to experiment with microformats markup without having to download a library and write any code:<\/p>\n\n If there\u2019s not currently a parser available for your language of choice, you have a few options:<\/p>\n\n Most real-world microformats data is fetched from a URL, which could potentially redirect to a different URL one or more times. The final URL in the redirect chain is called the \u201ceffective URL\u201d. HTML often contains relative URLs, which need to be resolved against a base URL in order to be useful out of context.<\/p>\n\n\n If your parser has a function for \u201cparsing microformats from a URL\u201d, it should deal with all of this for you. If you\u2019re making the request yourself (e.g. to use custom caching or network settings) and then passing the response HTML and base URL to the parser, make sure to use the effective URL, not the starting URL!<\/strong> The parser will handle relative URL resolution, but it needs to know the correct base URL.<\/p>\n\n When parsing microformats, an HTTP request which returns a non-200 value doesn\u2019t necessarily mean that there\u2019s nothing to parse! For example, a When consuming microformats 2 data, you\u2019ll most often be fetching raw HTML from a URL, parsing it to canonical JSON, then finally processing it into a simpler, cleaned and sanitised format ready for use in your website or application. That\u2019s three different representations of the same data \u2014 you\u2019ll most likely end up storing the derived data somewhere for quick access, but what about the other two?<\/p>\n\n Experience shows that, over time:<\/p>\n\n Therefore, if it makes sense for your use case, I recommend archiving a copy of the original HTML as well as your derived data, leaving out the intermediate canonical JSON. That way, you can easily create scripts or background jobs to update all the derived data based on the original HTML, taking advantage of both parser improvements and improvements to your own code at the same time, without having to re-fetch potentially hundreds of potentially broken links.<\/p>\n\n As mentioned in the previous section, if you archive original HTML for re-parsing, you\u2019ll need to additionally store the effective URL for correct relative URL resolution.<\/p>\n\n For some languages, there are already libraries (such as XRay<\/a> for PHP) which will perform common cleaning and sanitisation for you. If the assumptions with which these libraries are built suit your applications, you may be able to avoid a lot of the hard work of handling raw microformats 2 data structures!<\/p>\n\n If not, read on\u2026<\/p>\n\n A parsed page may contain a number of microformat data structures (mf structs), in various different places.<\/p>\n\n Take a look at the parsed canonical microformats JSON for the article you\u2019re reading right now<\/a>, for example.<\/p>\n\n Each individual mf struct is guaranteed to have at least two keys, Generally speaking, For many common use cases (e.g. a homepage feed and profile) there are several different ways people might nest mf structs to achieve the same goals, so it\u2019s important that your code is capable of searching the entire tree, rather than just looking at the top-level mf structs. Never assume that the microformat struct you\u2019re looking for will be in the top-level of the I recommend writing some functions which can traverse a mf tree and return all structs which match a filtering callback. This can then be used as a basis for writing more specific convenience functions for common tasks such as finding all microformats on a page of a particular type, or where a certain property matches a certain value.<\/p>\n\n See my microformats2 PHP functions<\/a> for some working examples.<\/p>\n\n Each key in a mf struct\u2019s A plain string value, containing no HTML, and leaving HTML entities unescaped (e.g. (In future examples I will leave out the encapsulating An embedded HTML struct, containing two keys: An img\/alt struct, containing the URL of a parsed image under A nested microformat data structure, with an additional All properties may have more than one value. In cases where you expect a single property value (e.g. Let\u2019s look at the implications of each of the potential property value structures in turn.<\/p>\n\n Firstly, Never assume that a property value will be a plaintext string<\/strong>. Microformats publishers can nest microformats, embedded content and img\/alt structures in a variety of different ways, and your consuming code should be as flexible as possible.<\/p>\n\n To partially make up for this complexity, you can always rely on the When you start consuming microformats 2, write a function like this, and get into the habit of using it every time<\/strong> you want a single, plaintext value from a property:<\/p>\n\n Secondly, Never assume that a particular property will contain an embedded HTML struct<\/strong> \u2014 this usually applies to In Python 3.5+, that could look something like this:<\/p>\n\n In some cases, it may make sense for your application to be aware of whether a value was parsed as embedded HTML or a plain text string, and to store\/treat them differently. In all other cases, always<\/strong> use a function like this when you\u2019re expecting embedded HTML data.<\/p>\n\n Thirdly, when expecting an image URL, check for an img\/alt structure, falling back to the plain text value (and either assuming an empty alt text or inferring an appropriate one, depending on your specific use case). Something like this could be a good starting point:<\/p>\n\n Finally, in cases where you expect a nested microformat, you might end up getting something else. This is the hardest case to deal with, and the one which depends the most on the specific data and use-case you\u2019re dealing with. For example, if you\u2019re expecting a nested h-card under an The first three are general principles which can be applied to many scenarios where you expect an embedded mf struct but find something else. The last three, however, are examples of a common trend in consuming microformats 2 data: for many common use-cases, there are well-thought-through algorithms you can use to interpret data in a standardised way.<\/p>\n\n The authorship algorithm mentioned above is one of several more-or-less formally established algorithms used to solve common problems in indieweb usages of microformats 2. Some others which are worth knowing about include:<\/p>\n\n Library implementations of these algorithms exist for some languages, although they often deviate slightly from the exact text. See if you can find one which meets your needs, and if not, write your own and share it with the community!<\/p>\n\n In addition to the formal consumption algorithms, it\u2019s worth looking through the definitions of the microformats vocabularies you\u2019re using (as well as testing with real-world data) and adding support for properties or publishing techniques you might not have thought of the first time around. Some examples to get you started:<\/p>\n\nChoose a Parser<\/h2>\n\n
\n
\n
\n
Considerations During Fetching and Parsing<\/h2>\n\n
410 Gone<\/code> response might contain a h-entry with a message explaining the deletion of whatever was there before.\n\n<\/p>
Storing Raw HTML vs Parsed Canonical JSON vs Derived Data<\/h2>\n\n
\n
Navigating Microformat Structures<\/h2>\n\n
items<\/code> is a list of top-level mf structs, each of which may contain nested mf structs either under their
properties<\/code> or
children<\/code> keys.<\/p>\n\n
type<\/code> and
properties<\/code>.
type<\/code> is the primary way of identifying what sort of thing that struct represents (e.g. a person, a post, an event). Structs can have more than one type if they represent multiple things at once without wanting to nest them \u2014 for example, a post detailing an event might be both a h-entry and a h-event at the same time. Structs can also have additional top-level keys such as
id<\/code> and
lang<\/code>.<\/p>\n\n
type<\/code> information is most useful when dealing with top-level mf structs, and mf structs nested under a
children<\/code> key. Nested mf structs found in
properties<\/code> will also have
type<\/code> information, but their usage is usually implied by the property name they\u2019re found under.<\/p>\n\n
items<\/code> list!<\/strong> You need to search the whole tree.<\/p>\n\n
Possible Property Values<\/h2>\n\n
properties<\/code> dict maps to a list of values for that property. Every property may map to multiple values, and those values may be a mixture of any of the following:<\/p>\n\n
<<\/code>)<\/p>\n\n
{\n \"items\"<\/span>: [{\n \"type\"<\/span>: [\"h-card\"<\/span>],\n \"properties\"<\/span>: {\n \"name\"<\/span>: [\"Barnaby Walters\"<\/span>]\n }\n }]\n}\n<\/code><\/pre>\n\n
{\"items\": [{\"type\": [\u2022\u2022\u2022], \u2022\u2022\u2022}]}<\/code> for brevity, focusing on the
properties<\/code> key of a single mf struct.)<\/p>\n\n
html<\/code>, which maps to an HTML representation of the property, and
value<\/code>, mapping to a plain text version.<\/p>\n\n
\"properties\": {\n \"content\": [{\n \"html\": \"<p>The content of<\/span> a post, as<\/span> <strong>raw HTML<\/strong> (or<\/span> not<\/span>).<\/p>\",\n \"value\": \"The content of<\/span> a post, as<\/span> raw HTML (or<\/span> not<\/span>).\"\n }]\n}\n<\/code><\/pre>\n
value<\/code>, and its alt text under
alt<\/code>.<\/p>\n\n
\"properties\"<\/span>: {\n \"photo\"<\/span>: [{\n \"value\"<\/span>: \"https:\/\/example.com\/profile-photo.jpg\"<\/span>,\n \"alt\"<\/span>: \"Example Person\"<\/span>\n }]\n}\n<\/code><\/pre>\n
value<\/code> key containing a plaintext representation of the data contained within.<\/p>\n\n
\"properties\"<\/span>: {\n \"author\"<\/span>: [{\n \"type\"<\/span>: [\"h-card\"<\/span>],\n \"properties\"<\/span>: {\n \"name\"<\/span>: [\"Barnaby Walters\"<\/span>]\n },\n \"value\"<\/span>: \"Barnaby Walters<\/span>\n }]\n}\n<\/code><\/pre>\n
name<\/code>), simply take the first one you find, and in cases where you expect multiple values, use all values you consider valid. There are also some cases where it may make sense to use multiple values, but to prioritise one based on some heuristic \u2014 for example, an h-card may have multiple
url<\/code> values, in which case the first one is usually the \u201ccanonical\u201d URL, and further URLs refer to external profiles.<\/p>\n\n
value<\/code> key of nested structs to provide you with an equivalent plaintext value<\/strong>, regardless of what type of struct you\u2019ve found.<\/p>\n\n
def<\/span> get_first_plaintext<\/span>(mf_struct, property_name)<\/span>:<\/span>\n try<\/span>:\n first_val = mf_struct['properties'<\/span>][property_name][0<\/span>]\n if<\/span> isinstance(first_val, str):\n return<\/span> first_val\n else<\/span>:\n return<\/span> first_val['value'<\/span>]\n except<\/span> (IndexError, KeyError):\n return<\/span> None<\/span>\n<\/code><\/pre>\n\n
content<\/code>, but is relevant anywhere your application expects embedded HTML. If you want to reliably get a value encoded as raw HTML, then you need to:<\/p>\n\n
\n
html<\/code> key). If so, take the value of the
html<\/code> key<\/li>\n
from<\/span> html import<\/span> escape\n\ndef<\/span> get_first_html<\/span>(mf_struct, property_name)<\/span>:<\/span>\n try<\/span>:\n first_val = mf_struct['properties'<\/span>][property_name][0<\/span>]\n if<\/span> isinstance(first_val, dict) and<\/span> 'html'<\/span> in<\/span> first_val:\n return<\/span> first_val['html'<\/span>]\n else<\/span>:\n plaintext_val = get_first_plaintext(mf_struct, property_name)\n\n if<\/span> plaintext_val is<\/span> not<\/span> None<\/span>:\n plaintext_val = escape(plaintext_val)\n\n return<\/span> plaintext_val\n except<\/span> (IndexError, KeyError):\n return<\/span> None<\/span>\n<\/code><\/pre>\n
def<\/span> get_img_alt<\/span>(mf_struct, property_name)<\/span>:<\/span>\n try<\/span>:\n first_val = mf_struct['properties'<\/span>][property_name][0<\/span>]\n if<\/span> isinstance(first_val, dict) and<\/span> 'alt'<\/span> in<\/span> first_val:\n return<\/span> first_val\n else<\/span>:\n plaintext_val = get_first_plaintext(mf_struct, property_name)\n\n if<\/span> plaintext_val is<\/span> not<\/span> None<\/span>:\n return<\/span> {'value'<\/span>: plaintext_val, 'alt'<\/span>: ''<\/span>}\n\n return<\/span> None<\/span>\n except<\/span> (IndexError, KeyError):\n return<\/span> None<\/span>\n<\/code><\/pre>\n
author<\/code> property, but get something else, you could use any of the following approaches:<\/p>\n\n
\n
name<\/code> property of an implied h-card structure with no other properties (and if you need a URL, you could potentially take the hostname of the effective URL, if it works in context as a useful fallback value)<\/li>\n
value<\/code> as the
photo<\/code> property, the
alt<\/code> as the
name<\/code> property, and potentially even take the hostname of the
photo<\/code> URL to be the implied fallback
url<\/code> property (although that\u2019s pushing it a bit, and in most cases it\u2019s probably better to just leave out the
url<\/code>)<\/li>\n
value<\/code> and use one of the first two approaches<\/li>\n
url<\/code> property but no
photo<\/code>, you could fetch the
url<\/code>, look for a representative h-card (more on that in the next section) and see if it has a
photo<\/code> property<\/li>\n
author<\/code> property as invalid and run the h-entry (or entire page if relevant) through the authorship algorithm<\/a><\/li>\n<\/ul>\n\n
Know Your Algorithms and Vocabularies<\/h2>\n\n
\n
\n
photo<\/code>, see if there\u2019s a valid
logo<\/code> you can use instead<\/li>\n
photo<\/code> property and the
featured<\/code> property, as one or the other might be used in different scenarios<\/li>\n
latitude<\/code> and
longitude<\/code> properties, a combined plaintext
geo<\/code> property, or an embedded
h-geo<\/code>. Addresses might be separate top-level properties or an embedded h-adr. There are many variations which are totally valid to publish, and your consuming code should be as liberal as possible in what it accepts.<\/li>\n
u-photo<\/code> within the
e-content<\/code>, they\u2019ll be present both in the
content<\/code>
html<\/code> key and also under the
photo<\/code> property. If your app shows the embedded
content<\/code> HTML rather than using the plaintext version, and also supports
photo<\/code> properties (which may also be present outside the
content<\/code>), you may have to sniff the presence of photos within the