···5566(** HTML5 DOM Node Types and Operations
7788- This module provides the DOM node representation used by the HTML5 parser.
99- Nodes form a tree structure representing HTML documents. The type follows
1010- the WHATWG HTML5 specification for document structure.
88+ This module provides the DOM (Document Object Model) node representation
99+ used by the HTML5 parser. The DOM is a programming interface that
1010+ represents an HTML document as a tree of nodes, where each node represents
1111+ part of the document (an element, text content, comment, etc.).
1212+1313+ {2 What is the DOM?}
1414+1515+ When an HTML parser processes markup like [<p>Hello <b>world</b></p>], it
1616+ doesn't store the text directly. Instead, it builds a tree structure in
1717+ memory:
1818+1919+ {v
2020+ Document
2121+ └── html
2222+ └── body
2323+ └── p
2424+ ├── #text "Hello "
2525+ └── b
2626+ └── #text "world"
2727+ v}
2828+2929+ This tree is the DOM. Each box in the tree is a {i node}. Programs can
3030+ traverse and modify this tree to read or change the document.
3131+3232+ @see <https://html.spec.whatwg.org/multipage/dom.html>
3333+ WHATWG: The elements of HTML (DOM chapter)
11341235 {2 Node Types}
13361437 The HTML5 DOM includes several node types, all represented by the same
1538 record type with different field usage:
16391717- - {b Element nodes}: Regular HTML elements like [<div>], [<p>], [<span>]
1818- - {b Text nodes}: Text content within elements
1919- - {b Comment nodes}: HTML comments [<!-- comment -->]
2020- - {b Document nodes}: The root node representing the entire document
2121- - {b Document fragment nodes}: A lightweight container (used for templates)
2222- - {b Doctype nodes}: The [<!DOCTYPE html>] declaration
4040+ - {b Element nodes}: HTML elements like [<div>], [<p>], [<a href="...">].
4141+ Elements are the building blocks of HTML documents. They can have
4242+ attributes and contain other nodes.
4343+4444+ - {b Text nodes}: The actual text content within elements. For example,
4545+ in [<p>Hello</p>], "Hello" is a text node that is a child of the [<p>]
4646+ element.
4747+4848+ - {b Comment nodes}: HTML comments written as [<!-- comment text -->].
4949+ Comments are preserved in the DOM but not rendered.
5050+5151+ - {b Document nodes}: The root of the entire document tree. Every HTML
5252+ document has exactly one Document node at the top.
5353+5454+ - {b Document fragment nodes}: Lightweight containers that hold a
5555+ collection of nodes without a parent. Used for efficient batch DOM
5656+ operations and [<template>] element contents.
5757+5858+ - {b Doctype nodes}: The [<!DOCTYPE html>] declaration at the start of
5959+ HTML5 documents. This declaration tells browsers to render the page
6060+ in standards mode.
6161+6262+ @see <https://html.spec.whatwg.org/multipage/dom.html#kinds-of-content>
6363+ WHATWG: Kinds of content
23642465 {2 Namespaces}
25662626- Elements can belong to different namespaces:
2727- - [None] or [Some "html"]: HTML namespace (default)
2828- - [Some "svg"]: SVG namespace for embedded SVG content
2929- - [Some "mathml"]: MathML namespace for mathematical notation
6767+ HTML5 can embed content from other XML vocabularies. Elements belong to
6868+ one of three {i namespaces}:
6969+7070+ - {b HTML namespace} ([None] or implicit): Standard HTML elements like
7171+ [<div>], [<p>], [<table>]. This is the default for all elements.
7272+7373+ - {b SVG namespace} ([Some "svg"]): Scalable Vector Graphics for drawing.
7474+ When the parser encounters an [<svg>] tag, all elements inside it
7575+ (like [<rect>], [<circle>], [<path>]) are placed in the SVG namespace.
7676+7777+ - {b MathML namespace} ([Some "mathml"]): Mathematical Markup Language
7878+ for equations. When the parser encounters a [<math>] tag, elements
7979+ inside it are placed in the MathML namespace.
8080+8181+ The parser automatically switches namespaces when entering and leaving
8282+ these foreign content islands.
30833131- The parser automatically switches namespaces when encountering [<svg>]
3232- or [<math>] elements, as specified by the HTML5 algorithm.
8484+ @see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inforeign>
8585+ WHATWG: Parsing foreign content
33863487 {2 Tree Structure}
35883689 Nodes form a bidirectional tree: each node has a list of children and
3737- an optional parent reference. Modification functions maintain these
3838- references automatically.
9090+ an optional parent reference. Modification functions in this module
9191+ maintain these references automatically.
9292+9393+ The tree is always well-formed: a node can only have one parent, and
9494+ circular references are not possible.
3995*)
40964197(** {1 Types} *)
42984399(** Information associated with a DOCTYPE node.
441004545- In HTML5, the DOCTYPE is primarily used for quirks mode detection.
4646- Most modern HTML5 documents use [<!DOCTYPE html>] which results in
4747- all fields being [None] or the name being [Some "html"].
101101+ The {i document type declaration} (DOCTYPE) tells browsers what version
102102+ of HTML the document uses. In HTML5, the standard declaration is simply:
103103+104104+ {v <!DOCTYPE html> v}
105105+106106+ This minimal DOCTYPE triggers {i standards mode} (no quirks). The DOCTYPE
107107+ can optionally include a public identifier and system identifier for
108108+ legacy compatibility with SGML-based tools, but these are rarely used
109109+ in modern HTML5 documents.
110110+111111+ {b Historical context:} In HTML4 and XHTML, DOCTYPEs were verbose and
112112+ referenced DTD files. For example:
113113+ {v <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
114114+ "http://www.w3.org/TR/html4/strict.dtd"> v}
115115+116116+ HTML5 simplified this to just [<!DOCTYPE html>] because:
117117+ - Browsers never actually fetched or validated against DTDs
118118+ - The DOCTYPE's only real purpose is triggering standards mode
119119+ - A minimal DOCTYPE achieves this goal
120120+121121+ {b Field meanings:}
122122+ - [name]: The document type name, almost always ["html"] for HTML documents
123123+ - [public_id]: A public identifier (legacy); [None] for HTML5
124124+ - [system_id]: A system identifier/URL (legacy); [None] for HTML5
48125126126+ @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
127127+ WHATWG: The DOCTYPE
49128 @see <https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode>
5050- The WHATWG specification for DOCTYPE handling
129129+ WHATWG: DOCTYPE handling during parsing
51130*)
52131type doctype_data = {
53132 name : string option; (** The DOCTYPE name, e.g., "html" *)
···5713658137(** Quirks mode setting for the document.
591386060- Quirks mode affects CSS layout behavior for backwards compatibility with
6161- old web content. The HTML5 parser determines quirks mode based on the
6262- DOCTYPE declaration.
139139+ {i Quirks mode} is a browser rendering mode that emulates bugs and
140140+ non-standard behaviors from older browsers (primarily Internet Explorer 5).
141141+ Modern HTML5 documents should always render in {i standards mode}
142142+ (no quirks) for consistent, predictable behavior.
143143+144144+ The HTML5 parser determines quirks mode based on the DOCTYPE declaration:
145145+146146+ - {b No_quirks} (Standards mode): The document renders according to modern
147147+ HTML5 and CSS specifications. This is triggered by [<!DOCTYPE html>].
148148+ CSS box model, table layout, and other features work as specified.
631496464- - [No_quirks]: Standards mode - full HTML5/CSS3 behavior
6565- - [Quirks]: Full quirks mode - emulates legacy browser behavior
6666- - [Limited_quirks]: Almost standards mode - limited quirks for specific cases
150150+ - {b Quirks} (Full quirks mode): The document renders with legacy browser
151151+ bugs emulated. This happens when:
152152+ {ul
153153+ {- DOCTYPE is missing entirely}
154154+ {- DOCTYPE has certain legacy public identifiers}
155155+ {- DOCTYPE has the wrong format}}
156156+157157+ In quirks mode, many CSS properties behave differently:
158158+ {ul
159159+ {- Tables don't inherit font properties}
160160+ {- Box model uses non-standard width calculations}
161161+ {- Certain CSS selectors don't work correctly}}
162162+163163+ - {b Limited_quirks} (Almost standards mode): A middle ground that applies
164164+ only a few specific quirks, primarily affecting table cell vertical
165165+ sizing. Triggered by XHTML DOCTYPEs and certain HTML4 DOCTYPEs.
166166+167167+ {b Recommendation:} Always use [<!DOCTYPE html>] at the start of HTML5
168168+ documents to ensure {b No_quirks} mode.
671696868- @see <https://quirks.spec.whatwg.org/> The Quirks Mode specification
170170+ @see <https://quirks.spec.whatwg.org/>
171171+ Quirks Mode Standard - detailed specification
172172+ @see <https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode>
173173+ WHATWG: How the parser determines quirks mode
69174*)
70175type quirks_mode = No_quirks | Quirks | Limited_quirks
71176···7317874179 All node types use the same record structure. The [name] field determines
75180 the node type:
7676- - Element: the tag name (e.g., "div", "p")
181181+ - Element: the tag name (e.g., "div", "p", "span")
77182 - Text: "#text"
78183 - Comment: "#comment"
79184 - Document: "#document"
80185 - Document fragment: "#document-fragment"
81186 - Doctype: "!doctype"
821878383- {3 Field Usage by Node Type}
188188+ {3 Understanding Node Fields}
189189+190190+ Different node types use different combinations of fields:
8419185192 {v
86193 Node Type | name | namespace | attrs | data | template_content | doctype
···92199 Document Fragment | "#document-frag" | No | No | No | No | No
93200 Doctype | "!doctype" | No | No | No | No | Yes
94201 v}
202202+203203+ {3 Element Tag Names}
204204+205205+ For element nodes, the [name] field contains the lowercase tag name.
206206+ HTML5 defines many elements with specific meanings:
207207+208208+ {b Structural elements:} [html], [head], [body], [header], [footer],
209209+ [main], [nav], [article], [section], [aside]
210210+211211+ {b Text content:} [p], [div], [span], [h1]-[h6], [pre], [blockquote]
212212+213213+ {b Lists:} [ul], [ol], [li], [dl], [dt], [dd]
214214+215215+ {b Tables:} [table], [tr], [td], [th], [thead], [tbody], [tfoot]
216216+217217+ {b Forms:} [form], [input], [button], [select], [textarea], [label]
218218+219219+ {b Media:} [img], [audio], [video], [canvas], [svg]
220220+221221+ @see <https://html.spec.whatwg.org/multipage/indices.html#elements-3>
222222+ WHATWG: Index of HTML elements
223223+224224+ {3 Void Elements}
225225+226226+ Some elements are {i void elements} - they cannot have children and have
227227+ no end tag. These include: [area], [base], [br], [col], [embed], [hr],
228228+ [img], [input], [link], [meta], [source], [track], [wbr].
229229+230230+ @see <https://html.spec.whatwg.org/multipage/syntax.html#void-elements>
231231+ WHATWG: Void elements
232232+233233+ {3 The Template Element}
234234+235235+ The [<template>] element is special: its children are not rendered
236236+ directly but stored in a separate document fragment accessible via
237237+ the [template_content] field. Templates are used for client-side
238238+ templating where content is cloned and inserted via JavaScript.
239239+240240+ @see <https://html.spec.whatwg.org/multipage/scripting.html#the-template-element>
241241+ WHATWG: The template element
95242*)
96243type node = {
97244 mutable name : string;
9898- (** Tag name for elements, or special name for other node types *)
245245+ (** Tag name for elements, or special name for other node types.
246246+247247+ For elements, this is the lowercase tag name (e.g., "div", "span").
248248+ For other node types, use the constants {!document_name},
249249+ {!text_name}, {!comment_name}, etc. *)
99250100251 mutable namespace : string option;
101101- (** Element namespace: [None] for HTML, [Some "svg"], [Some "mathml"] *)
252252+ (** Element namespace: [None] for HTML, [Some "svg"], [Some "mathml"].
253253+254254+ Most elements are in the HTML namespace ([None]). The SVG and MathML
255255+ namespaces are only used when content appears inside [<svg>] or
256256+ [<math>] elements respectively.
257257+258258+ @see <https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom>
259259+ WHATWG: Elements in the DOM *)
102260103261 mutable attrs : (string * string) list;
104104- (** Element attributes as (name, value) pairs *)
262262+ (** Element attributes as (name, value) pairs.
263263+264264+ Attributes provide additional information about elements. Common
265265+ global attributes include:
266266+ - [id]: Unique identifier for the element
267267+ - [class]: Space-separated list of CSS class names
268268+ - [style]: Inline CSS styles
269269+ - [title]: Advisory text (shown as tooltip)
270270+ - [lang]: Language of the element's content
271271+ - [hidden]: Whether the element should be hidden
272272+273273+ Element-specific attributes include:
274274+ - [href] on [<a>]: The link destination URL
275275+ - [src] on [<img>]: The image source URL
276276+ - [type] on [<input>]: The input control type
277277+ - [disabled] on form controls: Whether the control is disabled
278278+279279+ In HTML5, attribute names are case-insensitive and are normalized
280280+ to lowercase by the parser.
281281+282282+ @see <https://html.spec.whatwg.org/multipage/dom.html#global-attributes>
283283+ WHATWG: Global attributes
284284+ @see <https://html.spec.whatwg.org/multipage/indices.html#attributes-3>
285285+ WHATWG: Index of attributes *)
105286106287 mutable children : node list;
107107- (** Child nodes in document order *)
288288+ (** Child nodes in document order.
289289+290290+ For most elements, this list contains the nested elements and text.
291291+ For void elements (like [<br>], [<img>]), this is always empty.
292292+ For [<template>] elements, the actual content is in
293293+ [template_content], not here. *)
108294109295 mutable parent : node option;
110110- (** Parent node, [None] for root nodes *)
296296+ (** Parent node, [None] for root nodes.
297297+298298+ Every node except the Document node has a parent. This back-reference
299299+ enables traversing up the tree. *)
111300112301 mutable data : string;
113113- (** Text content for text and comment nodes *)
302302+ (** Text content for text and comment nodes.
303303+304304+ For text nodes, this contains the actual text. For comment nodes,
305305+ this contains the comment text (without the [<!--] and [-->]
306306+ delimiters). For other node types, this field is empty. *)
114307115308 mutable template_content : node option;
116116- (** Document fragment for [<template>] element contents *)
309309+ (** Document fragment for [<template>] element contents.
310310+311311+ The [<template>] element holds "inert" content that is not
312312+ rendered but can be cloned and inserted elsewhere. This field
313313+ contains a document fragment with the template's content.
314314+315315+ For non-template elements, this is [None].
316316+317317+ @see <https://html.spec.whatwg.org/multipage/scripting.html#the-template-element>
318318+ WHATWG: The template element *)
117319118320 mutable doctype : doctype_data option;
119119- (** DOCTYPE information for doctype nodes *)
321321+ (** DOCTYPE information for doctype nodes.
322322+323323+ Only doctype nodes use this field; for all other nodes it is [None]. *)
120324}
121325122326(** {1 Node Name Constants}
···126330*)
127331128332val document_name : string
129129-(** ["#document"] - name for document nodes *)
333333+(** ["#document"] - name for document nodes.
334334+335335+ The Document node is the root of every HTML document tree. It represents
336336+ the entire document and is the parent of the [<html>] element.
337337+338338+ @see <https://html.spec.whatwg.org/multipage/dom.html#document>
339339+ WHATWG: The Document object *)
130340131341val document_fragment_name : string
132132-(** ["#document-fragment"] - name for document fragment nodes *)
342342+(** ["#document-fragment"] - name for document fragment nodes.
343343+344344+ Document fragments are lightweight container nodes used to hold a
345345+ collection of nodes without a parent document. They are used:
346346+ - To hold [<template>] element contents
347347+ - As results of fragment parsing (innerHTML)
348348+ - For efficient batch DOM operations
349349+350350+ @see <https://dom.spec.whatwg.org/#documentfragment>
351351+ DOM Standard: DocumentFragment *)
133352134353val text_name : string
135135-(** ["#text"] - name for text nodes *)
354354+(** ["#text"] - name for text nodes.
355355+356356+ Text nodes contain the character data within elements. When the
357357+ parser encounters text between tags like [<p>Hello world</p>],
358358+ it creates a text node with data ["Hello world"] as a child of
359359+ the [<p>] element.
360360+361361+ Adjacent text nodes are automatically merged by the parser. *)
136362137363val comment_name : string
138138-(** ["#comment"] - name for comment nodes *)
364364+(** ["#comment"] - name for comment nodes.
365365+366366+ Comment nodes represent HTML comments: [<!-- comment text -->].
367367+ Comments are preserved in the DOM but not rendered to users.
368368+ They're useful for development notes or conditional content. *)
139369140370val doctype_name : string
141141-(** ["!doctype"] - name for doctype nodes *)
371371+(** ["!doctype"] - name for doctype nodes.
372372+373373+ The DOCTYPE node represents the [<!DOCTYPE html>] declaration.
374374+ It is always the first child of the Document node (if present).
375375+376376+ @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
377377+ WHATWG: The DOCTYPE *)
142378143379(** {1 Constructors}
144380145381 Functions to create new DOM nodes. All nodes start with no parent and
146146- no children.
382382+ no children. Use {!append_child} or {!insert_before} to build a tree.
147383*)
148384149385val create_element : string -> ?namespace:string option ->
150386 ?attrs:(string * string) list -> unit -> node
151387(** Create an element node.
152388153153- @param name The tag name (e.g., "div", "p", "span")
154154- @param namespace Element namespace: [None] for HTML, [Some "svg"], [Some "mathml"]
155155- @param attrs Initial attributes as (name, value) pairs
389389+ Elements are the primary building blocks of HTML documents. Each
390390+ element represents a component of the document with semantic meaning.
391391+392392+ @param name The tag name (e.g., "div", "p", "span"). Tag names are
393393+ case-insensitive in HTML; by convention, use lowercase.
394394+ @param namespace Element namespace:
395395+ - [None] (default): HTML namespace for standard elements
396396+ - [Some "svg"]: SVG namespace for graphics elements
397397+ - [Some "mathml"]: MathML namespace for mathematical notation
398398+ @param attrs Initial attributes as [(name, value)] pairs
156399400400+ {b Examples:}
157401 {[
402402+ (* Simple HTML element *)
158403 let div = create_element "div" ()
159159- let svg = create_element "rect" ~namespace:(Some "svg") ()
160160- let link = create_element "a" ~attrs:[("href", "/")] ()
404404+405405+ (* Element with attributes *)
406406+ let link = create_element "a"
407407+ ~attrs:[("href", "https://example.com"); ("class", "external")]
408408+ ()
409409+410410+ (* SVG element *)
411411+ let rect = create_element "rect"
412412+ ~namespace:(Some "svg")
413413+ ~attrs:[("width", "100"); ("height", "50"); ("fill", "blue")]
414414+ ()
161415 ]}
416416+417417+ @see <https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom>
418418+ WHATWG: Elements in the DOM
162419*)
163420164421val create_text : string -> node
165422(** Create a text node with the given content.
166423424424+ Text nodes contain the readable content of HTML documents. They
425425+ appear as children of elements and represent the characters that
426426+ users see.
427427+428428+ {b Note:} Text content is stored as-is. Character references like
429429+ [&] should already be decoded to their character values.
430430+431431+ {b Example:}
167432 {[
168433 let text = create_text "Hello, world!"
434434+ (* To put text in a paragraph: *)
435435+ let p = create_element "p" () in
436436+ append_child p text
169437 ]}
170438*)
171439172440val create_comment : string -> node
173441(** Create a comment node with the given content.
174442175175- The content should not include the comment delimiters.
443443+ Comments are human-readable notes in HTML that don't appear in
444444+ the rendered output. They're written as [<!-- comment -->] in HTML.
176445446446+ @param data The comment text (without the [<!--] and [-->] delimiters)
447447+448448+ {b Example:}
177449 {[
178178- let comment = create_comment " This is a comment "
179179- (* Represents: <!-- This is a comment --> *)
450450+ let comment = create_comment " TODO: Add navigation "
451451+ (* Represents: <!-- TODO: Add navigation --> *)
180452 ]}
453453+454454+ @see <https://html.spec.whatwg.org/multipage/syntax.html#comments>
455455+ WHATWG: HTML comments
181456*)
182457183458val create_document : unit -> node
184459(** Create an empty document node.
185460186186- Document nodes are the root of a complete HTML document tree.
461461+ The Document node is the root of an HTML document tree. It represents
462462+ the entire document and serves as the parent for the DOCTYPE (if any)
463463+ and the root [<html>] element.
464464+465465+ In a complete HTML document, the structure is:
466466+ {v
467467+ #document
468468+ ├── !doctype
469469+ └── html
470470+ ├── head
471471+ └── body
472472+ v}
473473+474474+ @see <https://html.spec.whatwg.org/multipage/dom.html#document>
475475+ WHATWG: The Document object
187476*)
188477189478val create_document_fragment : unit -> node
190479(** Create an empty document fragment.
191480192192- Document fragments are lightweight containers used for:
193193- - Template contents
194194- - Fragment parsing results
195195- - Efficient batch DOM operations
481481+ Document fragments are lightweight containers that can hold multiple
482482+ nodes without being part of the main document tree. They're useful for:
483483+484484+ - {b Template contents:} The [<template>] element stores its children
485485+ in a document fragment, keeping them inert until cloned
486486+487487+ - {b Fragment parsing:} When parsing HTML fragments (like innerHTML),
488488+ the result is placed in a document fragment
489489+490490+ - {b Batch operations:} Build a subtree in a fragment, then insert it
491491+ into the document in one operation for better performance
492492+493493+ @see <https://dom.spec.whatwg.org/#documentfragment>
494494+ DOM Standard: DocumentFragment
196495*)
197496198497val create_doctype : ?name:string -> ?public_id:string ->
199498 ?system_id:string -> unit -> node
200499(** Create a DOCTYPE node.
201500202202- For HTML5, use [create_doctype ~name:"html" ()] which produces
203203- [<!DOCTYPE html>].
501501+ The DOCTYPE declaration tells browsers to use standards mode for
502502+ rendering. For HTML5 documents, use:
503503+504504+ {[
505505+ let doctype = create_doctype ~name:"html" ()
506506+ (* Represents: <!DOCTYPE html> *)
507507+ ]}
508508+509509+ @param name DOCTYPE name (usually ["html"] for HTML documents)
510510+ @param public_id Public identifier (legacy, rarely needed)
511511+ @param system_id System identifier (legacy, rarely needed)
512512+513513+ {b Legacy example:}
514514+ {[
515515+ (* HTML 4.01 Strict DOCTYPE - not recommended for new documents *)
516516+ let legacy = create_doctype
517517+ ~name:"HTML"
518518+ ~public_id:"-//W3C//DTD HTML 4.01//EN"
519519+ ~system_id:"http://www.w3.org/TR/html4/strict.dtd"
520520+ ()
521521+ ]}
204522205205- @param name DOCTYPE name (usually "html")
206206- @param public_id Public identifier (legacy)
207207- @param system_id System identifier (legacy)
523523+ @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
524524+ WHATWG: The DOCTYPE
208525*)
209526210527val create_template : ?namespace:string option ->
211528 ?attrs:(string * string) list -> unit -> node
212529(** Create a [<template>] element with its content document fragment.
213530214214- Template elements have special semantics: their children are not rendered
215215- directly but stored in a separate document fragment accessible via
216216- [template_content].
531531+ The [<template>] element holds inert HTML content that is not
532532+ rendered directly. The content is stored in a separate document
533533+ fragment and can be:
534534+ - Cloned and inserted into the document via JavaScript
535535+ - Used as a stamping template for repeated content
536536+ - Pre-parsed without affecting the page
537537+538538+ {b How templates work:}
539539+540540+ Unlike normal elements, a [<template>]'s children are not rendered.
541541+ Instead, they're stored in the [template_content] field. This means:
542542+ - Images inside won't load
543543+ - Scripts inside won't execute
544544+ - The content is "inert" until explicitly activated
545545+546546+ {b Example:}
547547+ {[
548548+ let template = create_template () in
549549+ let div = create_element "div" () in
550550+ let text = create_text "Template content" in
551551+ append_child div text;
552552+ (* Add to template's content fragment, not children *)
553553+ match template.template_content with
554554+ | Some fragment -> append_child fragment div
555555+ | None -> ()
556556+ ]}
217557218558 @see <https://html.spec.whatwg.org/multipage/scripting.html#the-template-element>
219219- The HTML5 template element specification
559559+ WHATWG: The template element
220560*)
221561222562(** {1 Node Type Predicates}
223563224224- Functions to test what type of node you have.
564564+ Functions to test what type of node you have. Since all nodes use the
565565+ same record type, these predicates check the [name] field to determine
566566+ the actual node type.
225567*)
226568227569val is_element : node -> bool
228570(** [is_element node] returns [true] if the node is an element node.
229571230230- Elements are nodes with HTML tags like [<div>], [<p>], etc.
572572+ Elements are HTML tags like [<div>], [<p>], [<a>]. They are
573573+ identified by having a tag name that doesn't match any of the
574574+ special node name constants.
231575*)
232576233577val is_text : node -> bool
234234-(** [is_text node] returns [true] if the node is a text node. *)
578578+(** [is_text node] returns [true] if the node is a text node.
579579+580580+ Text nodes contain the character content within elements.
581581+ They have [name = "#text"]. *)
235582236583val is_comment : node -> bool
237237-(** [is_comment node] returns [true] if the node is a comment node. *)
584584+(** [is_comment node] returns [true] if the node is a comment node.
585585+586586+ Comment nodes represent HTML comments [<!-- ... -->].
587587+ They have [name = "#comment"]. *)
238588239589val is_document : node -> bool
240240-(** [is_document node] returns [true] if the node is a document node. *)
590590+(** [is_document node] returns [true] if the node is a document node.
591591+592592+ The document node is the root of the DOM tree.
593593+ It has [name = "#document"]. *)
241594242595val is_document_fragment : node -> bool
243243-(** [is_document_fragment node] returns [true] if the node is a document fragment. *)
596596+(** [is_document_fragment node] returns [true] if the node is a document fragment.
597597+598598+ Document fragments are lightweight containers.
599599+ They have [name = "#document-fragment"]. *)
244600245601val is_doctype : node -> bool
246246-(** [is_doctype node] returns [true] if the node is a DOCTYPE node. *)
602602+(** [is_doctype node] returns [true] if the node is a DOCTYPE node.
603603+604604+ DOCTYPE nodes represent the [<!DOCTYPE>] declaration.
605605+ They have [name = "!doctype"]. *)
247606248607val has_children : node -> bool
249249-(** [has_children node] returns [true] if the node has any children. *)
608608+(** [has_children node] returns [true] if the node has any children.
609609+610610+ Note: For [<template>] elements, this checks the direct children list,
611611+ not the template content fragment. *)
250612251613(** {1 Tree Manipulation}
252614253615 Functions to modify the DOM tree structure. These functions automatically
254254- maintain parent/child references.
616616+ maintain parent/child references, ensuring the tree remains consistent.
255617*)
256618257619val append_child : node -> node -> unit
258620(** [append_child parent child] adds [child] as the last child of [parent].
259621260622 The child's parent reference is updated to point to [parent].
623623+ If the child already has a parent, it is first removed from that parent.
624624+625625+ {b Example:}
626626+ {[
627627+ let body = create_element "body" () in
628628+ let p = create_element "p" () in
629629+ let text = create_text "Hello!" in
630630+ append_child p text;
631631+ append_child body p
632632+ (* Result:
633633+ body
634634+ └── p
635635+ └── #text "Hello!"
636636+ *)
637637+ ]}
261638*)
262639263640val insert_before : node -> node -> node -> unit
264641(** [insert_before parent new_child ref_child] inserts [new_child] before
265642 [ref_child] in [parent]'s children.
266643267267- @raise Not_found if [ref_child] is not a child of [parent]
644644+ @param parent The parent node
645645+ @param new_child The node to insert
646646+ @param ref_child The existing child to insert before
647647+648648+ Raises [Not_found] if [ref_child] is not a child of [parent].
649649+650650+ {b Example:}
651651+ {[
652652+ let ul = create_element "ul" () in
653653+ let li1 = create_element "li" () in
654654+ let li3 = create_element "li" () in
655655+ append_child ul li1;
656656+ append_child ul li3;
657657+ let li2 = create_element "li" () in
658658+ insert_before ul li2 li3
659659+ (* Result: ul contains li1, li2, li3 in that order *)
660660+ ]}
268661*)
269662270663val remove_child : node -> node -> unit
271664(** [remove_child parent child] removes [child] from [parent]'s children.
272665273666 The child's parent reference is set to [None].
667667+668668+ Raises [Not_found] if [child] is not a child of [parent].
274669*)
275670276671val insert_text_at : node -> string -> node option -> unit
277672(** [insert_text_at parent text before_node] inserts text content.
278673279674 If [before_node] is [None], appends at the end. If the previous sibling
280280- is a text node, the text is merged into it. Otherwise, a new text node
281281- is created.
675675+ is a text node, the text is merged into it (text nodes are coalesced).
676676+ Otherwise, a new text node is created.
282677283678 This implements the HTML5 parser's text insertion algorithm which
284284- coalesces adjacent text nodes.
679679+ ensures adjacent text nodes are always merged, matching browser behavior.
680680+681681+ @see <https://html.spec.whatwg.org/multipage/parsing.html#appropriate-place-for-inserting-a-node>
682682+ WHATWG: Inserting text in the DOM
285683*)
286684287685(** {1 Attribute Operations}
288686289289- Functions to read and modify element attributes.
687687+ Functions to read and modify element attributes. Attributes are
688688+ name-value pairs that provide additional information about elements.
689689+690690+ In HTML5, attribute names are case-insensitive and normalized to
691691+ lowercase by the parser.
692692+693693+ @see <https://html.spec.whatwg.org/multipage/dom.html#attributes>
694694+ WHATWG: Attributes
290695*)
291696292697val get_attr : node -> string -> string option
293293-(** [get_attr node name] returns the value of attribute [name], or [None]. *)
698698+(** [get_attr node name] returns the value of attribute [name], or [None]
699699+ if the attribute doesn't exist.
700700+701701+ Attribute lookup is case-sensitive on the stored (lowercase) names.
702702+*)
294703295704val set_attr : node -> string -> string -> unit
296705(** [set_attr node name value] sets attribute [name] to [value].
297706298707 If the attribute already exists, it is replaced.
708708+ If it doesn't exist, it is added.
299709*)
300710301711val has_attr : node -> string -> bool
···310720(** [descendants node] returns all descendant nodes in document order.
311721312722 This performs a depth-first traversal, returning children before
313313- siblings at each level.
723723+ siblings at each level. The node itself is not included.
724724+725725+ {b Document order} is the order nodes appear in the HTML source:
726726+ parent before children, earlier siblings before later ones.
727727+728728+ {b Example:}
729729+ {[
730730+ (* For tree: div > (p > "hello", span > "world") *)
731731+ descendants div
732732+ (* Returns: [p; text("hello"); span; text("world")] *)
733733+ ]}
314734*)
315735316736val ancestors : node -> node list
317737(** [ancestors node] returns all ancestor nodes from parent to root.
318738319319- The first element is the immediate parent, the last is the root.
739739+ The first element is the immediate parent, the last is the root
740740+ (usually the Document node).
741741+742742+ {b Example:}
743743+ {[
744744+ (* For a text node inside: html > body > p > text *)
745745+ ancestors text_node
746746+ (* Returns: [p; body; html; #document] *)
747747+ ]}
320748*)
321749322750val get_text_content : node -> string
323751(** [get_text_content node] returns the concatenated text content.
324752325325- For text nodes, returns the text data. For elements, recursively
326326- concatenates all descendant text content.
753753+ For text nodes, returns the text data directly.
754754+ For elements, recursively concatenates all descendant text content.
755755+ For other node types, returns an empty string.
756756+757757+ {b Example:}
758758+ {[
759759+ (* For: <p>Hello <b>world</b>!</p> *)
760760+ get_text_content p_element
761761+ (* Returns: "Hello world!" *)
762762+ ]}
327763*)
328764329765(** {1 Cloning} *)
···333769334770 @param deep If [true], recursively clone all descendants (default: [false])
335771336336- The cloned node has no parent. Attribute lists are copied by reference
337337- (the list itself is new, but attribute strings are shared).
772772+ The cloned node has no parent. With [deep:false], only the node itself
773773+ is copied (with its attributes, but not its children).
774774+775775+ {b Example:}
776776+ {[
777777+ let original = create_element "div" ~attrs:[("class", "box")] () in
778778+ let shallow = clone original in
779779+ let deep = clone ~deep:true original
780780+ ]}
338781*)
+666-107
lib/html5rw/html5rw.mli
···5566(** Html5rw - Pure OCaml HTML5 Parser
7788- This module provides a complete HTML5 parsing solution following the
99- WHATWG specification. It uses bytesrw for streaming input/output.
88+ This library provides a complete HTML5 parsing solution that implements the
99+ {{:https://html.spec.whatwg.org/multipage/parsing.html} WHATWG HTML5
1010+ parsing specification}. It can parse any HTML document - well-formed or not -
1111+ and produce a DOM (Document Object Model) tree that matches browser behavior.
1212+1313+ {2 What is HTML?}
1414+1515+ HTML (HyperText Markup Language) is the standard markup language for creating
1616+ web pages. An HTML document consists of nested {i elements} that describe
1717+ the structure and content of the page:
1818+1919+ {v
2020+ <!DOCTYPE html>
2121+ <html>
2222+ <head>
2323+ <title>My Page</title>
2424+ </head>
2525+ <body>
2626+ <h1>Welcome</h1>
2727+ <p>Hello, <b>world</b>!</p>
2828+ </body>
2929+ </html>
3030+ v}
3131+3232+ Each element is written with a {i start tag} (like [<p>]), content, and an
3333+ {i end tag} (like [</p>]). Elements can have {i attributes} that provide
3434+ additional information: [<a href="https://example.com">].
3535+3636+ @see <https://html.spec.whatwg.org/multipage/introduction.html>
3737+ WHATWG: Introduction to HTML
3838+3939+ {2 The DOM}
4040+4141+ When this parser processes HTML, it doesn't just store the text. Instead,
4242+ it builds a tree structure called the DOM (Document Object Model). Each
4343+ element, text fragment, and comment becomes a {i node} in this tree:
4444+4545+ {v
4646+ Document
4747+ └── html
4848+ ├── head
4949+ │ └── title
5050+ │ └── #text "My Page"
5151+ └── body
5252+ ├── h1
5353+ │ └── #text "Welcome"
5454+ └── p
5555+ ├── #text "Hello, "
5656+ ├── b
5757+ │ └── #text "world"
5858+ └── #text "!"
5959+ v}
6060+6161+ This tree can be traversed, searched, and modified. The {!Dom} module
6262+ provides types and functions for working with DOM nodes.
6363+6464+ @see <https://html.spec.whatwg.org/multipage/dom.html>
6565+ WHATWG: The elements of HTML (DOM chapter)
10661167 {2 Quick Start}
12681313- Parse HTML from a reader:
6969+ Parse HTML from a string:
1470 {[
1571 open Bytesrw
1672 let reader = Bytes.Reader.of_string "<p>Hello, world!</p>" in
···3288 let result = Html5rw.parse reader in
3389 let divs = Html5rw.query result "div.content"
3490 ]}
9191+9292+ {2 Error Handling}
9393+9494+ Unlike many parsers, HTML5 parsing {b never fails}. The WHATWG specification
9595+ defines error recovery rules for every possible malformed input, ensuring
9696+ all HTML documents produce a valid DOM tree (just as browsers do).
9797+9898+ For example, parsing [<p>Hello<p>World] produces two paragraphs, not an
9999+ error, because [<p>] implicitly closes the previous [<p>].
100100+101101+ If you need to detect malformed HTML (e.g., for validation), enable error
102102+ collection with [~collect_errors:true]. Errors are advisory - the parsing
103103+ still succeeds.
104104+105105+ @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
106106+ WHATWG: Parse errors
107107+108108+ {2 HTML vs XHTML}
109109+110110+ This parser implements {b HTML5 parsing}, not XHTML parsing. Key differences:
111111+112112+ - Tag and attribute names are case-insensitive ([<DIV>] equals [<div>])
113113+ - Some end tags are optional ([<p>Hello] is valid)
114114+ - Void elements have no end tag ([<br>], not [<br/>] or [<br></br>])
115115+ - Boolean attributes need no value ([<input disabled>])
116116+117117+ XHTML uses stricter XML rules. If you need XHTML parsing, use an XML parser.
118118+119119+ @see <https://html.spec.whatwg.org/multipage/syntax.html>
120120+ WHATWG: The HTML syntax
35121*)
3612237123(** {1 Sub-modules} *)
381243939-(** DOM types and manipulation functions *)
125125+(** DOM types and manipulation functions.
126126+127127+ This module provides the core types for representing HTML documents as
128128+ DOM trees. It includes:
129129+ - The {!Dom.node} type representing all kinds of DOM nodes
130130+ - Functions to create, modify, and traverse nodes
131131+ - Serialization functions to convert DOM back to HTML
132132+133133+ @see <https://html.spec.whatwg.org/multipage/dom.html>
134134+ WHATWG: The elements of HTML *)
40135module Dom = Html5rw_dom
411364242-(** HTML5 tokenizer *)
137137+(** HTML5 tokenizer.
138138+139139+ The tokenizer is the first stage of HTML5 parsing. It converts a stream
140140+ of characters into a stream of {i tokens}: start tags, end tags, text,
141141+ comments, and DOCTYPEs.
142142+143143+ Most users don't need to use the tokenizer directly - the {!parse}
144144+ function handles everything. The tokenizer is exposed for advanced use
145145+ cases like syntax highlighting or partial parsing.
146146+147147+ @see <https://html.spec.whatwg.org/multipage/parsing.html#tokenization>
148148+ WHATWG: Tokenization *)
43149module Tokenizer = Html5rw_tokenizer
441504545-(** Encoding detection and decoding *)
151151+(** Encoding detection and decoding.
152152+153153+ HTML documents can use various character encodings (UTF-8, ISO-8859-1,
154154+ etc.). This module implements the WHATWG encoding sniffing algorithm
155155+ that browsers use to detect the encoding of a document:
156156+157157+ 1. Check for a BOM (Byte Order Mark)
158158+ 2. Look for a [<meta charset>] declaration
159159+ 3. Use HTTP Content-Type header hint (if available)
160160+ 4. Fall back to UTF-8
161161+162162+ @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
163163+ WHATWG: Determining the character encoding
164164+ @see <https://encoding.spec.whatwg.org/>
165165+ WHATWG Encoding Standard *)
46166module Encoding = Html5rw_encoding
471674848-(** CSS selector engine *)
168168+(** CSS selector engine.
169169+170170+ This module provides CSS selector support for querying the DOM tree.
171171+ CSS selectors are patterns used to select HTML elements based on their
172172+ tag names, attributes, classes, IDs, and position in the document.
173173+174174+ Example selectors:
175175+ - [div] - all [<div>] elements
176176+ - [#header] - element with [id="header"]
177177+ - [.warning] - elements with [class="warning"]
178178+ - [div > p] - [<p>] elements that are direct children of [<div>]
179179+ - [[href]] - elements with an [href] attribute
180180+181181+ @see <https://www.w3.org/TR/selectors-4/>
182182+ W3C Selectors Level 4 specification *)
49183module Selector = Html5rw_selector
501845151-(** HTML entity decoding *)
185185+(** HTML entity decoding.
186186+187187+ HTML uses {i character references} to represent characters that are
188188+ hard to type or have special meaning:
189189+190190+ - Named references: [&] (ampersand), [<] (less than), [ ] (non-breaking space)
191191+ - Decimal references: [<] (less than as decimal 60)
192192+ - Hexadecimal references: [<] (less than as hex 3C)
193193+194194+ This module decodes all 2,231 named character references defined in
195195+ the WHATWG specification, plus numeric references.
196196+197197+ @see <https://html.spec.whatwg.org/multipage/named-characters.html>
198198+ WHATWG: Named character references *)
52199module Entities = Html5rw_entities
532005454-(** Low-level parser access *)
201201+(** Low-level parser access.
202202+203203+ This module exposes the internals of the HTML5 parser for advanced use.
204204+ Most users should use the top-level {!parse} function instead.
205205+206206+ The parser exposes:
207207+ - Insertion modes for the tree construction algorithm
208208+ - The tree builder state machine
209209+ - Lower-level parsing functions
210210+211211+ @see <https://html.spec.whatwg.org/multipage/parsing.html#tree-construction>
212212+ WHATWG: Tree construction *)
55213module Parser = Html5rw_parser
5621457215(** {1 Core Types} *)
582165959-(** DOM node type. See {!Dom} for manipulation functions. *)
217217+(** DOM node type.
218218+219219+ A node represents one part of an HTML document. Nodes form a tree
220220+ structure with parent/child relationships. There are several kinds:
221221+222222+ - {b Element nodes}: HTML tags like [<div>], [<p>], [<a>]
223223+ - {b Text nodes}: Text content within elements
224224+ - {b Comment nodes}: HTML comments [<!-- ... -->]
225225+ - {b Document nodes}: The root of a document tree
226226+ - {b Document fragment nodes}: Lightweight containers
227227+ - {b Doctype nodes}: The [<!DOCTYPE html>] declaration
228228+229229+ See {!Dom} for manipulation functions.
230230+231231+ @see <https://html.spec.whatwg.org/multipage/dom.html>
232232+ WHATWG: The DOM *)
60233type node = Dom.node
612346262-(** Doctype information *)
235235+(** DOCTYPE information.
236236+237237+ The DOCTYPE declaration ([<!DOCTYPE html>]) appears at the start of HTML
238238+ documents. It tells browsers to use standards mode for rendering.
239239+240240+ In HTML5, the DOCTYPE is minimal - just [<!DOCTYPE html>] with no public
241241+ or system identifiers. Legacy DOCTYPEs may have additional fields.
242242+243243+ @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
244244+ WHATWG: The DOCTYPE *)
63245type doctype_data = Dom.doctype_data = {
64246 name : string option;
247247+ (** DOCTYPE name, typically ["html"] *)
248248+65249 public_id : string option;
250250+ (** Public identifier for legacy DOCTYPEs (e.g., XHTML, HTML4) *)
251251+66252 system_id : string option;
253253+ (** System identifier (URL) for legacy DOCTYPEs *)
67254}
682556969-(** Quirks mode as determined during parsing *)
256256+(** Quirks mode as determined during parsing.
257257+258258+ {i Quirks mode} controls how browsers render CSS and compute layouts.
259259+ It exists for backwards compatibility with old web pages that relied
260260+ on browser bugs.
261261+262262+ - {b No_quirks}: Standards mode. The document is rendered according to
263263+ modern HTML5 and CSS specifications. Triggered by [<!DOCTYPE html>].
264264+265265+ - {b Quirks}: Full quirks mode. The browser emulates bugs from older
266266+ browsers (primarily IE5). Triggered by missing or malformed DOCTYPEs.
267267+ Affects CSS box model, table layout, font inheritance, and more.
268268+269269+ - {b Limited_quirks}: Almost standards mode. Only a few specific quirks
270270+ are applied, mainly affecting table cell vertical alignment.
271271+272272+ {b Recommendation:} Always use [<!DOCTYPE html>] to ensure standards mode.
273273+274274+ @see <https://quirks.spec.whatwg.org/>
275275+ Quirks Mode Standard
276276+ @see <https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode>
277277+ WHATWG: How quirks mode is determined *)
70278type quirks_mode = Dom.quirks_mode = No_quirks | Quirks | Limited_quirks
712797272-(** Character encoding detected or specified *)
280280+(** Character encoding detected or specified.
281281+282282+ HTML documents are sequences of bytes that must be decoded into characters.
283283+ Different encodings interpret the same bytes differently. For example:
284284+285285+ - UTF-8: The modern standard, supporting all Unicode characters
286286+ - Windows-1252: Common on older Western European web pages
287287+ - ISO-8859-2: Used for Central European languages
288288+ - UTF-16: Used by some Windows applications
289289+290290+ The parser detects encoding automatically when using {!parse_bytes}.
291291+ The detected encoding is available via {!val-encoding}.
292292+293293+ @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
294294+ WHATWG: Determining the character encoding
295295+ @see <https://encoding.spec.whatwg.org/>
296296+ WHATWG Encoding Standard *)
73297type encoding = Encoding.encoding =
74298 | Utf8
299299+ (** UTF-8: The dominant encoding for the web, supporting all Unicode *)
300300+75301 | Utf16le
302302+ (** UTF-16 Little-Endian: 16-bit encoding, used by Windows *)
303303+76304 | Utf16be
305305+ (** UTF-16 Big-Endian: 16-bit encoding, network byte order *)
306306+77307 | Windows_1252
308308+ (** Windows-1252 (CP-1252): Western European, superset of ISO-8859-1 *)
309309+78310 | Iso_8859_2
311311+ (** ISO-8859-2: Central European (Polish, Czech, Hungarian, etc.) *)
312312+79313 | Euc_jp
314314+ (** EUC-JP: Extended Unix Code for Japanese *)
8031581316(** A parse error encountered during HTML5 parsing.
823178383- HTML5 parsing never fails - the specification defines error recovery
8484- for all malformed input. However, conformance checkers can report
8585- these errors. Enable error collection with [~collect_errors:true].
318318+ HTML5 parsing {b never fails} - the specification defines error recovery
319319+ for all malformed input. However, conformance checkers can report these
320320+ errors. Enable error collection with [~collect_errors:true] if you want
321321+ to detect malformed HTML.
322322+323323+ {b Common parse errors:}
324324+325325+ - ["unexpected-null-character"]: Null byte in the input
326326+ - ["eof-before-tag-name"]: File ended while reading a tag
327327+ - ["unexpected-character-in-attribute-name"]: Invalid attribute syntax
328328+ - ["missing-doctype"]: Document started without [<!DOCTYPE>]
329329+ - ["duplicate-attribute"]: Same attribute appears twice on an element
330330+331331+ The full list of parse error codes is defined in the WHATWG specification.
8633287333 @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
8888- WHATWG parse error definitions
8989-*)
334334+ WHATWG: Complete list of parse errors *)
90335type parse_error = Parser.parse_error
913369292-(** Get the error code (e.g., "unexpected-null-character"). *)
337337+(** Get the error code string.
338338+339339+ Error codes are lowercase with hyphens, matching the WHATWG specification
340340+ names. Examples: ["unexpected-null-character"], ["eof-in-tag"],
341341+ ["missing-end-tag-name"].
342342+343343+ @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
344344+ WHATWG: Parse error codes *)
93345val error_code : parse_error -> string
943469595-(** Get the line number where the error occurred (1-indexed). *)
347347+(** Get the line number where the error occurred (1-indexed).
348348+349349+ Line numbers count from 1 and increment at each newline character. *)
96350val error_line : parse_error -> int
973519898-(** Get the column number where the error occurred (1-indexed). *)
352352+(** Get the column number where the error occurred (1-indexed).
353353+354354+ Column numbers count from 1 and reset at each newline. *)
99355val error_column : parse_error -> int
100356101357(** Context element for HTML fragment parsing (innerHTML).
102358103103- When parsing HTML fragments, you must specify what element would
104104- contain the fragment. This affects how certain elements are handled.
359359+ When parsing HTML fragments (like the [innerHTML] of an element), you
360360+ must specify what element would contain the fragment. This affects how
361361+ the parser handles certain elements.
362362+363363+ {b Why context matters:}
364364+365365+ HTML parsing rules depend on where content appears. For example:
366366+ - [<td>] is valid inside [<tr>] but not inside [<div>]
367367+ - [<li>] is valid inside [<ul>] but creates implied lists elsewhere
368368+ - Content inside [<table>] has special parsing rules
369369+370370+ {b Example:}
371371+ {[
372372+ (* Parse as if content were inside a <ul> *)
373373+ let ctx = make_fragment_context ~tag_name:"ul" () in
374374+ let result = parse ~fragment_context:ctx reader
375375+ (* Now <li> elements are parsed correctly *)
376376+ ]}
105377106378 @see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments>
107107- The fragment parsing algorithm
108108-*)
379379+ WHATWG: The fragment parsing algorithm *)
109380type fragment_context = Parser.fragment_context
110381111382(** Create a fragment parsing context.
112383113113- @param tag_name Tag name of the context element (e.g., "div", "tr")
114114- @param namespace Namespace: [None] for HTML, [Some "svg"], [Some "mathml"]
384384+ The context element determines how the parser interprets the fragment.
385385+ Choose a context that matches where the fragment would be inserted.
115386387387+ @param tag_name Tag name of the context element (e.g., ["div"], ["tr"],
388388+ ["ul"]). This is the element that would contain the fragment.
389389+ @param namespace Namespace of the context element:
390390+ - [None] (default): HTML namespace
391391+ - [Some "svg"]: SVG namespace
392392+ - [Some "mathml"]: MathML namespace
393393+394394+ {b Examples:}
116395 {[
117117- (* Parse as innerHTML of a <ul> *)
118118- let ctx = Html5rw.make_fragment_context ~tag_name:"ul" ()
396396+ (* Parse as innerHTML of a <div> (most common case) *)
397397+ let ctx = make_fragment_context ~tag_name:"div" ()
398398+399399+ (* Parse as innerHTML of a <ul> - <li> elements work correctly *)
400400+ let ctx = make_fragment_context ~tag_name:"ul" ()
119401120402 (* Parse as innerHTML of an SVG <g> element *)
121121- let ctx = Html5rw.make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") ()
403403+ let ctx = make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") ()
404404+405405+ (* Parse as innerHTML of a <table> - table-specific rules apply *)
406406+ let ctx = make_fragment_context ~tag_name:"table" ()
122407 ]}
123123-*)
408408+409409+ @see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments>
410410+ WHATWG: Fragment parsing algorithm *)
124411val make_fragment_context : tag_name:string -> ?namespace:string option ->
125412 unit -> fragment_context
126413···132419133420(** Result of parsing an HTML document.
134421135135- Contains the parsed DOM tree, any errors encountered, and the
136136- detected encoding (when parsing from bytes).
422422+ This record contains everything produced by parsing:
423423+ - The DOM tree (accessible via {!val-root})
424424+ - Any parse errors (accessible via {!val-errors})
425425+ - The detected encoding (accessible via {!val-encoding})
137426*)
138427type t = {
139428 root : node;
429429+ (** Root node of the parsed document tree.
430430+431431+ For full document parsing, this is a Document node containing the
432432+ DOCTYPE (if any) and [<html>] element.
433433+434434+ For fragment parsing, this is a Document Fragment containing the
435435+ parsed elements. *)
436436+140437 errors : parse_error list;
438438+ (** Parse errors encountered during parsing.
439439+440440+ This list is empty unless [~collect_errors:true] was passed to the
441441+ parse function. Errors are in the order they were encountered.
442442+443443+ @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
444444+ WHATWG: Parse errors *)
445445+141446 encoding : encoding option;
447447+ (** Character encoding detected during parsing.
448448+449449+ This is [Some encoding] when using {!parse_bytes} with automatic
450450+ encoding detection, and [None] when using {!parse} (which expects
451451+ pre-decoded UTF-8 input). *)
142452}
143453144454(** {1 Parsing Functions} *)
145455146456(** Parse HTML from a [Bytes.Reader.t].
147457148148- This is the primary parsing function. Create a reader from any source:
149149- - [Bytes.Reader.of_string s] for strings
150150- - [Bytes.Reader.of_in_channel ic] for files
151151- - [Bytes.Reader.of_bytes b] for byte buffers
458458+ This is the primary parsing function. It reads bytes from the provided
459459+ reader and returns a DOM tree. The input should be valid UTF-8.
152460461461+ {b Creating readers:}
153462 {[
154463 open Bytesrw
155155- let reader = Bytes.Reader.of_string "<html><body>Hello</body></html>" in
464464+465465+ (* From a string *)
466466+ let reader = Bytes.Reader.of_string html_string
467467+468468+ (* From a file *)
469469+ let ic = open_in "page.html" in
470470+ let reader = Bytes.Reader.of_in_channel ic
471471+472472+ (* From a buffer *)
473473+ let reader = Bytes.Reader.of_buffer buf
474474+ ]}
475475+476476+ {b Parsing a complete document:}
477477+ {[
156478 let result = Html5rw.parse reader
479479+ let doc = Html5rw.root result
157480 ]}
158481159159- @param collect_errors If true, collect parse errors (default: false)
160160- @param fragment_context Context element for fragment parsing
161161-*)
162162-val parse : ?collect_errors:bool -> ?fragment_context:fragment_context -> Bytesrw.Bytes.Reader.t -> t
482482+ {b Parsing a fragment:}
483483+ {[
484484+ let ctx = Html5rw.make_fragment_context ~tag_name:"div" () in
485485+ let result = Html5rw.parse ~fragment_context:ctx reader
486486+ ]}
487487+488488+ @param collect_errors If [true], collect parse errors. Default: [false].
489489+ Error collection has some performance overhead.
490490+ @param fragment_context Context element for fragment parsing. If provided,
491491+ the input is parsed as a fragment (like innerHTML) rather than
492492+ a complete document.
493493+494494+ @see <https://html.spec.whatwg.org/multipage/parsing.html>
495495+ WHATWG: HTML parsing algorithm *)
496496+val parse : ?collect_errors:bool -> ?fragment_context:fragment_context ->
497497+ Bytesrw.Bytes.Reader.t -> t
163498164499(** Parse raw bytes with automatic encoding detection.
165500166166- This function implements the WHATWG encoding sniffing algorithm:
167167- 1. Check for BOM (Byte Order Mark)
168168- 2. Prescan for <meta charset>
169169- 3. Fall back to UTF-8
501501+ This function is useful when you have raw bytes and don't know the
502502+ character encoding. It implements the WHATWG encoding sniffing algorithm:
503503+504504+ 1. {b BOM detection}: Check for UTF-8, UTF-16LE, or UTF-16BE BOM
505505+ 2. {b Prescan}: Look for [<meta charset="...">] in the first 1024 bytes
506506+ 3. {b Transport hint}: Use the provided [transport_encoding] if any
507507+ 4. {b Fallback}: Use UTF-8 (the modern web default)
508508+509509+ The detected encoding is stored in the result's [encoding] field.
510510+511511+ {b Example:}
512512+ {[
513513+ let bytes = really_input_bytes ic (in_channel_length ic) in
514514+ let result = Html5rw.parse_bytes bytes in
515515+ match Html5rw.encoding result with
516516+ | Some Utf8 -> print_endline "UTF-8 detected"
517517+ | Some Windows_1252 -> print_endline "Windows-1252 detected"
518518+ | _ -> ()
519519+ ]}
170520171171- @param collect_errors If true, collect parse errors (default: false)
172172- @param transport_encoding Encoding from HTTP Content-Type header
173173- @param fragment_context Context element for fragment parsing
174174-*)
175175-val parse_bytes : ?collect_errors:bool -> ?transport_encoding:string -> ?fragment_context:fragment_context -> bytes -> t
521521+ @param collect_errors If [true], collect parse errors. Default: [false].
522522+ @param transport_encoding Encoding hint from HTTP Content-Type header.
523523+ For example, if the server sends [Content-Type: text/html; charset=utf-8],
524524+ pass [~transport_encoding:"utf-8"].
525525+ @param fragment_context Context element for fragment parsing.
526526+527527+ @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
528528+ WHATWG: Determining the character encoding *)
529529+val parse_bytes : ?collect_errors:bool -> ?transport_encoding:string ->
530530+ ?fragment_context:fragment_context -> bytes -> t
176531177532(** {1 Querying} *)
178533179534(** Query the DOM tree with a CSS selector.
180535181181- Supported selectors:
182182- - Tag: [div], [p], [span]
183183- - ID: [#myid]
184184- - Class: [.myclass]
185185- - Universal: [*]
186186- - Attribute: [[attr]], [[attr="value"]], [[attr~="value"]], [[attr|="value"]]
187187- - Pseudo-classes: [:first-child], [:last-child], [:nth-child(n)]
188188- - Combinators: descendant (space), child (>), adjacent sibling (+), general sibling (~)
536536+ CSS selectors are patterns used to select elements in HTML documents.
537537+ This function returns all nodes matching the selector, in document order.
538538+539539+ {b Supported selectors:}
540540+541541+ {i Type selectors:}
542542+ - [div], [p], [span] - elements by tag name
543543+544544+ {i Class and ID selectors:}
545545+ - [#myid] - element with [id="myid"]
546546+ - [.myclass] - elements with class containing "myclass"
547547+548548+ {i Attribute selectors:}
549549+ - [[attr]] - elements with the [attr] attribute
550550+ - [[attr="value"]] - attribute equals value
551551+ - [[attr~="value"]] - attribute contains word
552552+ - [[attr|="value"]] - attribute starts with value or value-
553553+ - [[attr^="value"]] - attribute starts with value
554554+ - [[attr$="value"]] - attribute ends with value
555555+ - [[attr*="value"]] - attribute contains value
556556+557557+ {i Pseudo-classes:}
558558+ - [:first-child], [:last-child] - first/last child of parent
559559+ - [:nth-child(n)] - nth child (1-indexed)
560560+ - [:only-child] - only child of parent
561561+ - [:empty] - elements with no children
562562+ - [:not(selector)] - elements not matching selector
189563564564+ {i Combinators:}
565565+ - [A B] - B descendants of A (any depth)
566566+ - [A > B] - B direct children of A
567567+ - [A + B] - B immediately after A (adjacent sibling)
568568+ - [A ~ B] - B after A (general sibling)
569569+570570+ {i Universal:}
571571+ - [*] - all elements
572572+573573+ {b Examples:}
190574 {[
191191- let divs = Html5rw.query result "div.content > p"
575575+ (* All paragraphs *)
576576+ let ps = query result "p"
577577+578578+ (* Elements with class "warning" inside a div *)
579579+ let warnings = query result "div .warning"
580580+581581+ (* Direct children of nav that are links *)
582582+ let nav_links = query result "nav > a"
583583+584584+ (* Complex selector *)
585585+ let items = query result "ul.menu > li:first-child a[href]"
192586 ]}
193587194194- @raise Selector.Selector_error if the selector is invalid
195195-*)
588588+ @raise Selector.Selector_error if the selector syntax is invalid
589589+590590+ @see <https://www.w3.org/TR/selectors-4/>
591591+ W3C: Selectors Level 4 *)
196592val query : t -> string -> node list
197593198198-(** Check if a node matches a CSS selector. *)
594594+(** Check if a node matches a CSS selector.
595595+596596+ This is useful for filtering nodes or implementing custom traversals.
597597+598598+ {b Example:}
599599+ {[
600600+ let is_external_link node =
601601+ matches node "a[href^='http']"
602602+ ]}
603603+604604+ @raise Selector.Selector_error if the selector syntax is invalid *)
199605val matches : node -> string -> bool
200606201607(** {1 Serialization} *)
202608203609(** Write the DOM tree to a [Bytes.Writer.t].
204610611611+ This serializes the DOM back to HTML. The output is valid HTML5 that
612612+ can be parsed to produce an equivalent DOM tree.
613613+614614+ {b Example:}
205615 {[
206616 open Bytesrw
207617 let buf = Buffer.create 1024 in
···211621 let html = Buffer.contents buf
212622 ]}
213623214214- @param pretty If true, format with indentation (default: true)
215215- @param indent_size Number of spaces per indent level (default: 2)
216216-*)
217217-val to_writer : ?pretty:bool -> ?indent_size:int -> t -> Bytesrw.Bytes.Writer.t -> unit
624624+ @param pretty If [true] (default), add indentation for readability.
625625+ If [false], output compact HTML with no added whitespace.
626626+ @param indent_size Spaces per indentation level (default: 2).
627627+ Only used when [pretty] is [true].
628628+629629+ @see <https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments>
630630+ WHATWG: Serialising HTML fragments *)
631631+val to_writer : ?pretty:bool -> ?indent_size:int -> t ->
632632+ Bytesrw.Bytes.Writer.t -> unit
218633219634(** Serialize the DOM tree to a string.
220635221221- Convenience function when the output fits in memory.
636636+ Convenience function that serializes to a string instead of a writer.
637637+ Use {!to_writer} for large documents to avoid memory allocation.
222638223223- @param pretty If true, format with indentation (default: true)
224224- @param indent_size Number of spaces per indent level (default: 2)
225225-*)
639639+ @param pretty If [true] (default), add indentation for readability.
640640+ @param indent_size Spaces per indentation level (default: 2). *)
226641val to_string : ?pretty:bool -> ?indent_size:int -> t -> string
227642228643(** Extract text content from the DOM tree.
229644230230- @param separator String to insert between text nodes (default: " ")
231231- @param strip If true, trim whitespace (default: true)
232232-*)
645645+ This concatenates all text nodes in the document, producing a string
646646+ with just the readable text (no HTML tags).
647647+648648+ {b Example:}
649649+ {[
650650+ (* For document: <div><p>Hello</p><p>World</p></div> *)
651651+ let text = to_text result
652652+ (* Returns: "Hello World" *)
653653+ ]}
654654+655655+ @param separator String to insert between text nodes (default: [" "])
656656+ @param strip If [true] (default), trim leading/trailing whitespace *)
233657val to_text : ?separator:string -> ?strip:bool -> t -> string
234658235235-(** Serialize to html5lib test format (for testing). *)
659659+(** Serialize to html5lib test format.
660660+661661+ This produces the tree format used by the
662662+ {{:https://github.com/html5lib/html5lib-tests} html5lib-tests} suite.
663663+ Mainly useful for testing the parser against the reference tests. *)
236664val to_test_format : t -> string
237665238666(** {1 Result Accessors} *)
239667240240-(** Get the root node of the parsed document. *)
668668+(** Get the root node of the parsed document.
669669+670670+ For full document parsing, this returns a Document node. The structure is:
671671+ {v
672672+ #document
673673+ ├── !doctype (if present)
674674+ └── html
675675+ ├── head
676676+ └── body
677677+ v}
678678+679679+ For fragment parsing, this returns a Document Fragment node containing
680680+ the parsed elements directly. *)
241681val root : t -> node
242682243243-(** Get parse errors (if error collection was enabled). *)
683683+(** Get parse errors (if error collection was enabled).
684684+685685+ Returns an empty list if [~collect_errors:true] was not passed to the
686686+ parse function, or if the document was well-formed.
687687+688688+ Errors are returned in the order they were encountered during parsing.
689689+690690+ @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
691691+ WHATWG: Parse errors *)
244692val errors : t -> parse_error list
245693246246-(** Get the detected encoding (if parsed from bytes). *)
694694+(** Get the detected encoding (if parsed from bytes).
695695+696696+ Returns [Some encoding] when {!parse_bytes} was used, indicating which
697697+ encoding was detected or specified. Returns [None] when {!parse} was
698698+ used, since it expects pre-decoded UTF-8 input.
699699+700700+ @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
701701+ WHATWG: Determining the character encoding *)
247702val encoding : t -> encoding option
248703249704(** {1 DOM Utilities}
250705251251- Common DOM operations are available directly. For the full API,
252252- see the {!Dom} module.
706706+ Common DOM operations are available directly on this module. For the
707707+ full API including more advanced operations, see the {!Dom} module.
708708+709709+ @see <https://html.spec.whatwg.org/multipage/dom.html>
710710+ WHATWG: The elements of HTML
253711*)
254712255713(** Create an element node.
256256- @param namespace None for HTML, Some "svg" or Some "mathml" for foreign content
257257- @param attrs List of (name, value) attribute pairs
258258-*)
259259-val create_element : string -> ?namespace:string option -> ?attrs:(string * string) list -> unit -> node
714714+715715+ Elements are the building blocks of HTML documents. They represent tags
716716+ like [<div>], [<p>], [<a>], etc.
717717+718718+ @param name Tag name (e.g., ["div"], ["p"], ["span"])
719719+ @param namespace Element namespace:
720720+ - [None] (default): HTML namespace
721721+ - [Some "svg"]: SVG namespace for graphics
722722+ - [Some "mathml"]: MathML namespace for math notation
723723+ @param attrs Initial attributes as [(name, value)] pairs
724724+725725+ {b Example:}
726726+ {[
727727+ (* Simple element *)
728728+ let div = create_element "div" ()
260729261261-(** Create a text node. *)
730730+ (* Element with attributes *)
731731+ let link = create_element "a"
732732+ ~attrs:[("href", "/about"); ("class", "nav-link")]
733733+ ()
734734+ ]}
735735+736736+ @see <https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom>
737737+ WHATWG: Elements in the DOM *)
738738+val create_element : string -> ?namespace:string option ->
739739+ ?attrs:(string * string) list -> unit -> node
740740+741741+(** Create a text node.
742742+743743+ Text nodes contain the readable text content of HTML documents.
744744+745745+ {b Example:}
746746+ {[
747747+ let text = create_text "Hello, world!"
748748+ ]} *)
262749val create_text : string -> node
263750264264-(** Create a comment node. *)
751751+(** Create a comment node.
752752+753753+ Comments are preserved in the DOM but not rendered. They're written
754754+ as [<!-- text -->] in HTML.
755755+756756+ @see <https://html.spec.whatwg.org/multipage/syntax.html#comments>
757757+ WHATWG: Comments *)
265758val create_comment : string -> node
266759267267-(** Create an empty document node. *)
760760+(** Create an empty document node.
761761+762762+ The Document node is the root of an HTML document tree.
763763+764764+ @see <https://html.spec.whatwg.org/multipage/dom.html#document>
765765+ WHATWG: The Document object *)
268766val create_document : unit -> node
269767270270-(** Create a document fragment node. *)
768768+(** Create a document fragment node.
769769+770770+ Document fragments are lightweight containers for holding nodes
771771+ without a parent document. Used for template contents and fragment
772772+ parsing results.
773773+774774+ @see <https://dom.spec.whatwg.org/#documentfragment>
775775+ DOM Standard: DocumentFragment *)
271776val create_document_fragment : unit -> node
272777273273-(** Create a doctype node. *)
274274-val create_doctype : ?name:string -> ?public_id:string -> ?system_id:string -> unit -> node
778778+(** Create a doctype node.
275779276276-(** Append a child node to a parent. *)
780780+ For HTML5 documents, use [create_doctype ~name:"html" ()].
781781+782782+ @param name DOCTYPE name (usually ["html"])
783783+ @param public_id Public identifier (legacy)
784784+ @param system_id System identifier (legacy)
785785+786786+ @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
787787+ WHATWG: The DOCTYPE *)
788788+val create_doctype : ?name:string -> ?public_id:string ->
789789+ ?system_id:string -> unit -> node
790790+791791+(** Append a child node to a parent.
792792+793793+ The child is added as the last child of the parent. If the child
794794+ already has a parent, it is first removed from that parent. *)
277795val append_child : node -> node -> unit
278796279279-(** Insert a node before a reference node. *)
797797+(** Insert a node before a reference node.
798798+799799+ @param parent The parent node
800800+ @param new_child The node to insert
801801+ @param ref_child The existing child to insert before
802802+803803+ Raises [Not_found] if [ref_child] is not a child of [parent]. *)
280804val insert_before : node -> node -> node -> unit
281805282282-(** Remove a child node from its parent. *)
806806+(** Remove a child node from its parent.
807807+808808+ Raises [Not_found] if [child] is not a child of [parent]. *)
283809val remove_child : node -> node -> unit
284810285285-(** Get an attribute value. *)
811811+(** Get an attribute value.
812812+813813+ Returns [Some value] if the attribute exists, [None] otherwise.
814814+ Attribute names are case-sensitive (but were lowercased during parsing).
815815+816816+ @see <https://html.spec.whatwg.org/multipage/dom.html#attributes>
817817+ WHATWG: Attributes *)
286818val get_attr : node -> string -> string option
287819288288-(** Set an attribute value. *)
820820+(** Set an attribute value.
821821+822822+ If the attribute exists, it is replaced. If not, it is added. *)
289823val set_attr : node -> string -> string -> unit
290824291825(** Check if a node has an attribute. *)
292826val has_attr : node -> string -> bool
293827294294-(** Get all descendant nodes. *)
828828+(** Get all descendant nodes in document order.
829829+830830+ Returns all nodes below this node in the tree, in the order they
831831+ appear in the HTML source (depth-first). *)
295832val descendants : node -> node list
296833297297-(** Get all ancestor nodes (from parent to root). *)
834834+(** Get all ancestor nodes from parent to root.
835835+836836+ Returns the chain of parent nodes, starting with the immediate parent
837837+ and ending with the Document node. *)
298838val ancestors : node -> node list
299839300300-(** Get text content of a node and its descendants. *)
840840+(** Get text content of a node and its descendants.
841841+842842+ For text nodes, returns the text directly. For elements, recursively
843843+ concatenates all descendant text content. *)
301844val get_text_content : node -> string
302845303846(** Clone a node.
304304- @param deep If true, also clone descendants (default: false)
305305-*)
847847+848848+ @param deep If [true], recursively clone all descendants.
849849+ If [false] (default), only clone the node itself. *)
306850val clone : ?deep:bool -> node -> node
307851308308-(** {1 Node Predicates} *)
852852+(** {1 Node Predicates}
309853310310-(** Test if a node is an element. *)
854854+ Functions to test what type of node you have.
855855+*)
856856+857857+(** Test if a node is an element.
858858+859859+ Elements are HTML tags like [<div>], [<p>], [<a>]. *)
311860val is_element : node -> bool
312861313313-(** Test if a node is a text node. *)
862862+(** Test if a node is a text node.
863863+864864+ Text nodes contain character content within elements. *)
314865val is_text : node -> bool
315866316316-(** Test if a node is a comment node. *)
867867+(** Test if a node is a comment node.
868868+869869+ Comment nodes represent HTML comments [<!-- ... -->]. *)
317870val is_comment : node -> bool
318871319319-(** Test if a node is a document node. *)
872872+(** Test if a node is a document node.
873873+874874+ The document node is the root of a complete HTML document tree. *)
320875val is_document : node -> bool
321876322322-(** Test if a node is a document fragment. *)
877877+(** Test if a node is a document fragment.
878878+879879+ Document fragments are lightweight containers for nodes. *)
323880val is_document_fragment : node -> bool
324881325325-(** Test if a node is a doctype node. *)
882882+(** Test if a node is a doctype node.
883883+884884+ Doctype nodes represent the [<!DOCTYPE>] declaration. *)
326885val is_doctype : node -> bool
327886328887(** Test if a node has children. *)
+431-93
lib/parser/html5rw_parser.mli
···33 SPDX-License-Identifier: MIT
44 ---------------------------------------------------------------------------*)
5566-(** HTML5 Parser
66+(** HTML5 Parser - Low-Level API
7788 This module provides the core HTML5 parsing functionality implementing
99- the WHATWG parsing specification. It handles tokenization, tree construction,
99+ the {{:https://html.spec.whatwg.org/multipage/parsing.html} WHATWG
1010+ HTML5 parsing specification}. It handles tokenization, tree construction,
1011 error recovery, and produces a DOM tree.
11121212- For most uses, prefer the top-level {!Html5rw} module which re-exports
1313- these functions with a simpler interface.
1313+ For most uses, prefer the top-level {!Html5rw} module which provides
1414+ a simpler interface. This module is for advanced use cases that need
1515+ access to parser internals.
1616+1717+ {2 How HTML5 Parsing Works}
1818+1919+ The HTML5 parsing algorithm is unusual compared to most parsers. It was
2020+ reverse-engineered from browser behavior rather than designed from a
2121+ formal grammar. This ensures the parser handles malformed HTML exactly
2222+ like web browsers do.
2323+2424+ The algorithm has three main phases:
2525+2626+ {3 1. Encoding Detection}
2727+2828+ Before parsing begins, the character encoding must be determined. The
2929+ WHATWG specification defines a "sniffing" algorithm:
3030+3131+ 1. Check for a BOM (Byte Order Mark) at the start
3232+ 2. Look for [<meta charset="...">] in the first 1024 bytes
3333+ 3. Use HTTP Content-Type header hint if available
3434+ 4. Fall back to UTF-8
3535+3636+ @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
3737+ WHATWG: Determining the character encoding
3838+3939+ {3 2. Tokenization}
4040+4141+ The tokenizer converts the input stream into a sequence of tokens.
4242+ It implements a state machine with over 80 states to handle:
4343+4444+ - Data (text content)
4545+ - Tags (start tags, end tags, self-closing tags)
4646+ - Comments
4747+ - DOCTYPEs
4848+ - Character references ([&], [<], [<])
4949+ - CDATA sections (in SVG/MathML)
5050+5151+ The tokenizer has special handling for:
5252+ - {b Raw text elements}: [<script>], [<style>] - no markup parsing inside
5353+ - {b Escapable raw text elements}: [<textarea>], [<title>] - limited parsing
5454+ - {b RCDATA}: Content where only character references are parsed
5555+5656+ @see <https://html.spec.whatwg.org/multipage/parsing.html#tokenization>
5757+ WHATWG: Tokenization
5858+5959+ {3 3. Tree Construction}
6060+6161+ The tree builder receives tokens from the tokenizer and builds the DOM
6262+ tree. It uses {i insertion modes} - a state machine that determines how
6363+ each token should be processed based on the current document context.
6464+6565+ {b Insertion modes} include:
6666+ - [initial]: Before the DOCTYPE
6767+ - [before_html]: Before the [<html>] element
6868+ - [before_head]: Before the [<head>] element
6969+ - [in_head]: Inside [<head>]
7070+ - [in_body]: Inside [<body>] (the most complex mode)
7171+ - [in_table]: Inside [<table>] (special handling)
7272+ - [in_template]: Inside [<template>]
7373+ - And many more...
7474+7575+ The tree builder maintains:
7676+ - {b Stack of open elements}: Elements that have been opened but not closed
7777+ - {b List of active formatting elements}: For handling nested formatting
7878+ - {b The template insertion mode stack}: For [<template>] elements
7979+8080+ @see <https://html.spec.whatwg.org/multipage/parsing.html#tree-construction>
8181+ WHATWG: Tree construction
8282+8383+ {2 Error Recovery}
8484+8585+ A key feature of HTML5 parsing is that it {b never fails}. The specification
8686+ defines error recovery for every possible malformed input. For example:
8787+8888+ - Missing end tags are implicitly closed
8989+ - Misnested tags are handled via the "adoption agency algorithm"
9090+ - Invalid characters are replaced with U+FFFD
9191+ - Unexpected elements are either ignored or moved to valid positions
14921515- {2 Parsing Algorithm}
9393+ This ensures every HTML document produces a valid DOM tree.
16941717- The HTML5 parsing algorithm is defined by the WHATWG specification and
1818- consists of several phases:
9595+ @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
9696+ WHATWG: Parse errors
19972020- 1. {b Encoding sniffing}: Detect character encoding from BOM, meta tags,
2121- or transport layer hints
2222- 2. {b Tokenization}: Convert the input stream into a sequence of tokens
2323- (start tags, end tags, character data, comments, etc.)
2424- 3. {b Tree construction}: Build the DOM tree using a state machine with
2525- multiple insertion modes
9898+ {2 The Adoption Agency Algorithm}
9999+100100+ One of the most complex parts of HTML5 parsing is handling misnested
101101+ formatting elements. For example:
102102+103103+ {v <p>Hello <b>world</p> <p>more</b> text</p> v}
261042727- The algorithm includes extensive error recovery to handle malformed HTML
2828- in a consistent way across browsers.
105105+ Browsers don't just error out - they use the "adoption agency algorithm"
106106+ to produce sensible results. This algorithm:
107107+ 1. Identifies formatting elements that span across other elements
108108+ 2. Reconstructs the tree to properly nest elements
109109+ 3. Moves nodes between parents as needed
291103030- @see <https://html.spec.whatwg.org/multipage/parsing.html>
3131- The WHATWG HTML Parsing specification
111111+ @see <https://html.spec.whatwg.org/multipage/parsing.html#adoption-agency-algorithm>
112112+ WHATWG: The adoption agency algorithm
32113*)
3311434115(** {1 Sub-modules} *)
35116117117+(** DOM types and manipulation. *)
36118module Dom = Html5rw_dom
119119+120120+(** HTML5 tokenizer.
121121+122122+ The tokenizer implements the first stage of HTML5 parsing, converting
123123+ an input byte stream into a sequence of tokens (start tags, end tags,
124124+ text, comments, DOCTYPEs).
125125+126126+ @see <https://html.spec.whatwg.org/multipage/parsing.html#tokenization>
127127+ WHATWG: Tokenization *)
37128module Tokenizer = Html5rw_tokenizer
129129+130130+(** Character encoding detection and conversion.
131131+132132+ @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
133133+ WHATWG: Determining the character encoding *)
38134module Encoding = Html5rw_encoding
135135+136136+(** HTML element constants and categories.
137137+138138+ This module provides lists of element names that have special handling
139139+ in the HTML5 parser:
140140+141141+ - {b Void elements}: Elements that cannot have children and have no end
142142+ tag ([area], [base], [br], [col], [embed], [hr], [img], [input],
143143+ [link], [meta], [source], [track], [wbr])
144144+145145+ - {b Formatting elements}: Elements tracked in the list of active
146146+ formatting elements for the adoption agency algorithm ([a], [b], [big],
147147+ [code], [em], [font], [i], [nobr], [s], [small], [strike], [strong],
148148+ [tt], [u])
149149+150150+ - {b Special elements}: Elements with special parsing rules that affect
151151+ scope and formatting reconstruction
152152+153153+ @see <https://html.spec.whatwg.org/multipage/syntax.html#void-elements>
154154+ WHATWG: Void elements
155155+ @see <https://html.spec.whatwg.org/multipage/parsing.html#formatting>
156156+ WHATWG: Formatting elements *)
39157module Constants : sig
40158 val void_elements : string list
159159+ (** Elements that cannot have children: [area], [base], [br], [col],
160160+ [embed], [hr], [img], [input], [link], [meta], [source], [track], [wbr].
161161+162162+ @see <https://html.spec.whatwg.org/multipage/syntax.html#void-elements>
163163+ WHATWG: Void elements *)
164164+41165 val formatting_elements : string list
166166+ (** Elements tracked for the adoption agency algorithm: [a], [b], [big],
167167+ [code], [em], [font], [i], [nobr], [s], [small], [strike], [strong],
168168+ [tt], [u].
169169+170170+ @see <https://html.spec.whatwg.org/multipage/parsing.html#formatting>
171171+ WHATWG: Formatting elements *)
172172+42173 val special_elements : string list
174174+ (** Elements with special parsing behavior that affect scope checking.
175175+176176+ @see <https://html.spec.whatwg.org/multipage/parsing.html#special>
177177+ WHATWG: Special elements *)
43178end
179179+180180+(** Parser insertion modes.
181181+182182+ Insertion modes are the states of the tree construction state machine.
183183+ They determine how each token from the tokenizer should be processed
184184+ based on the current document context.
185185+186186+ For example, a [<td>] tag is handled differently depending on whether
187187+ the parser is currently in a table context or in the body.
188188+189189+ @see <https://html.spec.whatwg.org/multipage/parsing.html#insertion-mode>
190190+ WHATWG: Insertion mode *)
44191module Insertion_mode : sig
45192 type t
193193+ (** The insertion mode type. Values include modes like [initial],
194194+ [before_html], [in_head], [in_body], [in_table], etc. *)
46195end
196196+197197+(** Tree builder state.
198198+199199+ The tree builder maintains the state needed for tree construction:
200200+ - Stack of open elements
201201+ - List of active formatting elements
202202+ - Template insertion mode stack
203203+ - Current insertion mode
204204+ - Foster parenting flag
205205+206206+ @see <https://html.spec.whatwg.org/multipage/parsing.html#tree-construction>
207207+ WHATWG: Tree construction *)
47208module Tree_builder : sig
48209 type t
210210+ (** The tree builder state. *)
49211end
5021251213(** {1 Types} *)
5221453215(** A parse error encountered during parsing.
542165555- HTML5 parsing never fails - it always produces a DOM tree. However,
5656- the specification defines many error conditions that conformance
5757- checkers should report. Error collection is optional and disabled
5858- by default for performance.
217217+ HTML5 parsing {b never fails} - it always produces a DOM tree. However,
218218+ the WHATWG specification defines 92 specific error conditions that
219219+ conformance checkers should report. These errors indicate malformed
220220+ HTML that browsers will still render (with error recovery).
221221+222222+ {b Error categories:}
223223+224224+ {i Tokenizer errors} (detected during tokenization):
225225+ - [abrupt-closing-of-empty-comment]: Comment closed with [-->] without content
226226+ - [abrupt-doctype-public-identifier]: DOCTYPE public ID ended unexpectedly
227227+ - [eof-before-tag-name]: End of file while reading a tag name
228228+ - [eof-in-tag]: End of file inside a tag
229229+ - [missing-attribute-value]: Attribute has [=] but no value
230230+ - [unexpected-null-character]: Null byte in the input
231231+ - [unexpected-question-mark-instead-of-tag-name]: [<?] used instead of [<!]
232232+233233+ {i Tree construction errors} (detected during tree building):
234234+ - [missing-doctype]: No DOCTYPE before first element
235235+ - [unexpected-token-*]: Token appeared in wrong context
236236+ - [foster-parenting]: Content moved outside table due to invalid position
592376060- Error codes follow the WHATWG specification naming convention,
6161- e.g., "unexpected-null-character", "eof-in-tag".
238238+ Enable error collection with [~collect_errors:true]. Error collection
239239+ has some performance overhead, so it's disabled by default.
6224063241 @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
6464- The list of HTML5 parse errors
6565-*)
242242+ WHATWG: Complete list of parse errors *)
66243type parse_error
6724468245(** Get the error code string.
692467070- Error codes are lowercase with hyphens, matching the WHATWG spec names
7171- like "unexpected-null-character" or "eof-before-tag-name".
7272-*)
247247+ Error codes are lowercase with hyphens, exactly matching the WHATWG
248248+ specification naming. Examples:
249249+ - ["unexpected-null-character"]
250250+ - ["eof-before-tag-name"]
251251+ - ["missing-end-tag-name"]
252252+ - ["duplicate-attribute"]
253253+ - ["missing-doctype"]
254254+255255+ @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
256256+ WHATWG: Parse error codes *)
73257val error_code : parse_error -> string
742587575-(** Get the line number where the error occurred (1-indexed). *)
259259+(** Get the line number where the error occurred.
260260+261261+ Line numbers are 1-indexed (first line is 1). Line breaks are
262262+ detected at LF (U+000A), CR (U+000D), and CR+LF sequences. *)
76263val error_line : parse_error -> int
772647878-(** Get the column number where the error occurred (1-indexed). *)
265265+(** Get the column number where the error occurred.
266266+267267+ Column numbers are 1-indexed (first column is 1). Columns reset
268268+ to 1 after each line break. Column counting uses code points,
269269+ not bytes or grapheme clusters. *)
79270val error_column : parse_error -> int
8027181272(** Context element for HTML fragment parsing.
822738383- When parsing an HTML fragment (innerHTML), you need to specify the
8484- context element that would contain the fragment. This affects how
8585- the parser handles certain elements.
274274+ When parsing HTML fragments (the content that would be assigned to
275275+ an element's [innerHTML]), the parser needs to know what element
276276+ would contain the fragment. This affects parsing in several ways:
862778787- For example, parsing [<td>] as a fragment of a [<tr>] works differently
8888- than parsing it as a fragment of a [<div>].
278278+ {b Parser state initialization:}
279279+ - For [<title>] or [<textarea>]: Tokenizer starts in RCDATA state
280280+ - For [<style>], [<xmp>], [<iframe>], [<noembed>], [<noframes>]:
281281+ Tokenizer starts in RAWTEXT state
282282+ - For [<script>]: Tokenizer starts in script data state
283283+ - For [<noscript>]: Tokenizer starts in RAWTEXT state (if scripting enabled)
284284+ - For [<plaintext>]: Tokenizer starts in PLAINTEXT state
285285+ - Otherwise: Tokenizer starts in data state
286286+287287+ {b Insertion mode:}
288288+ The initial insertion mode depends on the context element:
289289+ - [<template>]: "in template" mode
290290+ - [<html>]: "before head" mode
291291+ - [<head>]: "in head" mode
292292+ - [<body>], [<div>], etc.: "in body" mode
293293+ - [<table>]: "in table" mode
294294+ - And so on...
8929590296 @see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments>
9191- The HTML fragment parsing algorithm
9292-*)
297297+ WHATWG: The fragment parsing algorithm *)
93298type fragment_context
9429995300(** Create a fragment parsing context.
963019797- @param tag_name The tag name of the context element (e.g., "div", "tr")
9898- @param namespace Namespace: [None] for HTML, [Some "svg"], [Some "mathml"]
302302+ @param tag_name Tag name of the context element. This should be the
303303+ tag name of the element that would contain the fragment.
304304+ Common choices:
305305+ - ["div"]: General-purpose (most common)
306306+ - ["body"]: For full body content
307307+ - ["tr"]: For table row content ([<td>] elements)
308308+ - ["ul"], ["ol"]: For list content ([<li>] elements)
309309+ - ["select"]: For [<option>] elements
99310311311+ @param namespace Element namespace:
312312+ - [None]: HTML namespace (default)
313313+ - [Some "svg"]: SVG namespace
314314+ - [Some "mathml"]: MathML namespace
315315+316316+ {b Examples:}
100317 {[
101101- (* Parse as innerHTML of a table row *)
318318+ (* Parse innerHTML of a table row - <td> works correctly *)
102319 let ctx = make_fragment_context ~tag_name:"tr" ()
103320104104- (* Parse as innerHTML of an SVG element *)
321321+ (* Parse innerHTML of an SVG group element *)
105322 let ctx = make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") ()
323323+324324+ (* Parse innerHTML of a select element - <option> works correctly *)
325325+ let ctx = make_fragment_context ~tag_name:"select" ()
106326 ]}
107107-*)
327327+328328+ @see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments>
329329+ WHATWG: Fragment parsing algorithm *)
108330val make_fragment_context : tag_name:string -> ?namespace:string option ->
109331 unit -> fragment_context
110332111333(** Get the tag name of a fragment context. *)
112334val fragment_context_tag : fragment_context -> string
113335114114-(** Get the namespace of a fragment context. *)
336336+(** Get the namespace of a fragment context ([None] for HTML). *)
115337val fragment_context_namespace : fragment_context -> string option
116338117339(** Result of parsing an HTML document or fragment.
118340119119- Contains the parsed DOM tree, any errors encountered (if error
120120- collection was enabled), and the detected encoding (for byte input).
341341+ This opaque type contains:
342342+ - The DOM tree (access via {!root})
343343+ - Parse errors if collection was enabled (access via {!errors})
344344+ - Detected encoding for byte input (access via {!encoding})
121345*)
122346type t
123347124348(** {1 Parsing Functions} *)
125349350350+(** Parse HTML from a byte stream reader.
351351+352352+ This function implements the complete HTML5 parsing algorithm:
353353+354354+ 1. Reads bytes from the provided reader
355355+ 2. Tokenizes the input into HTML tokens
356356+ 3. Constructs a DOM tree using the tree construction algorithm
357357+ 4. Returns the parsed result
358358+359359+ The input should be valid UTF-8. For automatic encoding detection
360360+ from raw bytes, use {!parse_bytes} instead.
361361+362362+ {b Parser behavior:}
363363+364364+ For {b full document parsing} (no fragment context), the parser:
365365+ - Creates a Document node as the root
366366+ - Processes any DOCTYPE declaration
367367+ - Creates [<html>], [<head>], and [<body>] elements as needed
368368+ - Builds the full document tree
369369+370370+ For {b fragment parsing} (with fragment context), the parser:
371371+ - Creates a Document Fragment as the root
372372+ - Initializes tokenizer state based on context element
373373+ - Initializes insertion mode based on context element
374374+ - Does not create implicit [<html>], [<head>], [<body>]
375375+376376+ @param collect_errors If [true], collect parse errors in the result.
377377+ Default: [false]. Enabling error collection adds overhead.
378378+ @param fragment_context Context for fragment parsing. If provided,
379379+ the input is parsed as fragment content (like innerHTML).
380380+381381+ @see <https://html.spec.whatwg.org/multipage/parsing.html>
382382+ WHATWG: HTML parsing *)
126383val parse : ?collect_errors:bool -> ?fragment_context:fragment_context ->
127384 Bytesrw.Bytes.Reader.t -> t
128128-(** Parse HTML from a byte stream reader.
385385+386386+(** Parse HTML bytes with automatic encoding detection.
387387+388388+ This function wraps {!parse} with encoding detection, implementing the
389389+ WHATWG encoding sniffing algorithm:
390390+391391+ {b Detection order:}
392392+ 1. {b BOM}: Check first 2-3 bytes for UTF-8, UTF-16LE, or UTF-16BE BOM
393393+ 2. {b Prescan}: Look for [<meta charset="...">] or
394394+ [<meta http-equiv="Content-Type" content="...charset=...">]
395395+ in the first 1024 bytes
396396+ 3. {b Transport hint}: Use [transport_encoding] if provided
397397+ 4. {b Fallback}: Use UTF-8
398398+399399+ The detected encoding is stored in the result (access via {!encoding}).
129400130130- This is the primary parsing function. The input must be valid UTF-8
131131- (or will be converted from detected encoding when using {!parse_bytes}).
401401+ {b Prescan details:}
132402133133- @param collect_errors If [true], collect parse errors (default: [false])
134134- @param fragment_context Context for fragment parsing (innerHTML)
403403+ The prescan algorithm parses just enough of the document to find a
404404+ charset declaration. It handles:
405405+ - [<meta charset="utf-8">]
406406+ - [<meta http-equiv="Content-Type" content="text/html; charset=utf-8">]
407407+ - Comments and other markup are skipped
408408+ - Parsing stops after 1024 bytes
135409136136- {[
137137- open Bytesrw
138138- let reader = Bytes.Reader.of_string "<p>Hello</p>" in
139139- let result = parse reader
140140- ]}
141141-*)
410410+ @param collect_errors If [true], collect parse errors. Default: [false].
411411+ @param transport_encoding Encoding hint from HTTP Content-Type header.
412412+ For example: ["utf-8"], ["iso-8859-1"], ["windows-1252"].
413413+ @param fragment_context Context for fragment parsing.
142414415415+ @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
416416+ WHATWG: Determining the character encoding
417417+ @see <https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding>
418418+ WHATWG: Prescan algorithm *)
143419val parse_bytes : ?collect_errors:bool -> ?transport_encoding:string ->
144420 ?fragment_context:fragment_context -> bytes -> t
145145-(** Parse HTML bytes with automatic encoding detection.
146421147147- Implements the WHATWG encoding sniffing algorithm:
148148- 1. Check for BOM (UTF-8, UTF-16LE, UTF-16BE)
149149- 2. Prescan for [<meta charset>] declaration
150150- 3. Use transport encoding hint if provided
151151- 4. Fall back to UTF-8
422422+(** {1 Result Accessors} *)
423423+424424+(** Get the root node of the parsed document.
152425153153- @param collect_errors If [true], collect parse errors (default: [false])
154154- @param transport_encoding Encoding from HTTP Content-Type header
155155- @param fragment_context Context for fragment parsing (innerHTML)
156156-*)
426426+ For full document parsing, returns a Document node with structure:
427427+ {v
428428+ #document
429429+ ├── !doctype (if DOCTYPE was present)
430430+ └── html
431431+ ├── head
432432+ │ └── ... (title, meta, link, script, style)
433433+ └── body
434434+ └── ... (page content)
435435+ v}
157436158158-(** {1 Result Accessors} *)
437437+ For fragment parsing, returns a Document Fragment node containing
438438+ the parsed elements directly (no implicit html/head/body).
159439440440+ @see <https://html.spec.whatwg.org/multipage/dom.html#document>
441441+ WHATWG: The Document object *)
160442val root : t -> Dom.node
161161-(** Get the root node of the parsed document.
162443163163- For full document parsing, this is a document node.
164164- For fragment parsing, this is a document fragment node.
165165-*)
444444+(** Get parse errors collected during parsing.
445445+446446+ Returns an empty list if error collection was not enabled
447447+ ([collect_errors:false] or omitted) or if the document was well-formed.
448448+449449+ Errors are returned in the order they were encountered.
166450451451+ {b Example:}
452452+ {[
453453+ let result = parse ~collect_errors:true reader in
454454+ List.iter (fun e ->
455455+ Printf.printf "Line %d, col %d: %s\n"
456456+ (error_line e) (error_column e) (error_code e)
457457+ ) (errors result)
458458+ ]}
459459+460460+ @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
461461+ WHATWG: Parse errors *)
167462val errors : t -> parse_error list
168168-(** Get parse errors (empty if error collection was disabled). *)
463463+464464+(** Get the detected encoding.
465465+466466+ Returns [Some encoding] when {!parse_bytes} was used, indicating which
467467+ encoding was detected or specified.
468468+469469+ Returns [None] when {!parse} was used (it expects pre-decoded UTF-8).
169470471471+ @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
472472+ WHATWG: Determining the character encoding *)
170473val encoding : t -> Encoding.encoding option
171171-(** Get the detected encoding (only set when using {!parse_bytes}). *)
172474173475(** {1 Querying} *)
174476175175-val query : t -> string -> Dom.node list
176477(** Query the DOM with a CSS selector.
177478479479+ Returns all elements matching the selector in document order.
480480+481481+ {b Supported selectors:}
482482+483483+ See {!Html5rw_selector} for the complete list. Key selectors include:
484484+ - Type: [div], [p], [a]
485485+ - ID: [#myid]
486486+ - Class: [.myclass]
487487+ - Attribute: [[href]], [[type="text"]]
488488+ - Pseudo-class: [:first-child], [:nth-child(2)]
489489+ - Combinators: [div p] (descendant), [div > p] (child)
490490+178491 @raise Html5rw_selector.Selector_error if the selector is invalid
179492180180- See {!Html5rw_selector} for supported selector syntax.
181181-*)
493493+ @see <https://www.w3.org/TR/selectors-4/>
494494+ W3C: Selectors Level 4 *)
495495+val query : t -> string -> Dom.node list
182496183497(** {1 Serialization} *)
184498499499+(** Serialize the DOM tree to a byte writer.
500500+501501+ Outputs valid HTML5 that can be parsed to produce an equivalent DOM tree.
502502+ The output follows the WHATWG serialization algorithm.
503503+504504+ {b Serialization rules:}
505505+ - Void elements are written without end tags
506506+ - Attributes are quoted with double quotes
507507+ - Special characters in text/attributes are escaped
508508+ - Comments preserve their content
509509+ - DOCTYPE is serialized as [<!DOCTYPE html>]
510510+511511+ @param pretty If [true] (default), add indentation for readability.
512512+ @param indent_size Spaces per indent level (default: 2).
513513+514514+ @see <https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments>
515515+ WHATWG: Serialising HTML fragments *)
185516val to_writer : ?pretty:bool -> ?indent_size:int -> t ->
186517 Bytesrw.Bytes.Writer.t -> unit
187187-(** Serialize the DOM tree to a byte stream writer.
188518189189- @param pretty If [true], format with indentation (default: [true])
190190- @param indent_size Spaces per indent level (default: [2])
191191-*)
519519+(** Serialize the DOM tree to a string.
192520521521+ Convenience wrapper around {!to_writer} that returns a string.
522522+523523+ @param pretty If [true] (default), add indentation for readability.
524524+ @param indent_size Spaces per indent level (default: 2). *)
193525val to_string : ?pretty:bool -> ?indent_size:int -> t -> string
194194-(** Serialize the DOM tree to a string.
195526196196- @param pretty If [true], format with indentation (default: [true])
197197- @param indent_size Spaces per indent level (default: [2])
198198-*)
199199-200200-val to_text : ?separator:string -> ?strip:bool -> t -> string
201527(** Extract text content from the DOM tree.
202528203203- @param separator String between text nodes (default: [" "])
204204- @param strip If [true], trim whitespace (default: [true])
205205-*)
529529+ Returns the concatenation of all text node content in document order,
530530+ with no HTML markup.
206531207207-val to_test_format : t -> string
532532+ @param separator String to insert between text nodes (default: [" "])
533533+ @param strip If [true] (default), trim leading/trailing whitespace *)
534534+val to_text : ?separator:string -> ?strip:bool -> t -> string
535535+208536(** Serialize to html5lib test format.
209537210210- This format is used by the html5lib test suite and shows the tree
211211- structure with indentation and node type prefixes.
212212-*)
538538+ This produces the tree representation format used by the
539539+ {{:https://github.com/html5lib/html5lib-tests} html5lib-tests} suite.
540540+541541+ The format shows the tree structure with:
542542+ - Indentation indicating depth (2 spaces per level)
543543+ - Prefixes indicating node type:
544544+ - [<!DOCTYPE ...>] for DOCTYPE
545545+ - [<tagname>] for elements (with attributes on same line)
546546+ - ["text"] for text nodes
547547+ - [<!-- comment -->] for comments
548548+549549+ Mainly useful for testing the parser against the reference test suite. *)
550550+val to_test_format : t -> string