···11+(*---------------------------------------------------------------------------
22+ Copyright (c) 2025 Anil Madhavapeddy <anil@recoil.org>. All rights reserved.
33+ SPDX-License-Identifier: MIT
44+ ---------------------------------------------------------------------------*)
55+66+(** HTML5 DOM Types and Operations
77+88+ This module provides the DOM (Document Object Model) node representation
99+ used by the HTML5 parser. The DOM is a programming interface that
1010+ represents an HTML document as a tree of nodes, where each node represents
1111+ part of the document (an element, text content, comment, etc.).
1212+1313+ {2 What is the DOM?}
1414+1515+ When an HTML parser processes markup like [<p>Hello <b>world</b></p>], it
1616+ doesn't store the text directly. Instead, it builds a tree structure in
1717+ memory:
1818+1919+ {v
2020+ Document
2121+ └── html
2222+ └── body
2323+ └── p
2424+ ├── #text "Hello "
2525+ └── b
2626+ └── #text "world"
2727+ v}
2828+2929+ This tree is the DOM. Each box in the tree is a {i node}. Programs can
3030+ traverse and modify this tree to read or change the document.
3131+3232+ @see <https://html.spec.whatwg.org/multipage/dom.html>
3333+ WHATWG: The elements of HTML (DOM chapter)
3434+3535+ {2 Node Types}
3636+3737+ The HTML5 DOM includes several node types, all represented by the same
3838+ record type with different field usage:
3939+4040+ - {b Element nodes}: HTML elements like [<div>], [<p>], [<a href="...">].
4141+ Elements are the building blocks of HTML documents. They can have
4242+ attributes and contain other nodes.
4343+4444+ - {b Text nodes}: The actual text content within elements. For example,
4545+ in [<p>Hello</p>], "Hello" is a text node that is a child of the [<p>]
4646+ element.
4747+4848+ - {b Comment nodes}: HTML comments written as [<!-- comment text -->].
4949+ Comments are preserved in the DOM but not rendered.
5050+5151+ - {b Document nodes}: The root of the entire document tree. Every HTML
5252+ document has exactly one Document node at the top.
5353+5454+ - {b Document fragment nodes}: Lightweight containers that hold a
5555+ collection of nodes without a parent. Used for efficient batch DOM
5656+ operations and [<template>] element contents.
5757+5858+ - {b Doctype nodes}: The [<!DOCTYPE html>] declaration at the start of
5959+ HTML5 documents. This declaration tells browsers to render the page
6060+ in standards mode.
6161+6262+ @see <https://html.spec.whatwg.org/multipage/dom.html#kinds-of-content>
6363+ WHATWG: Kinds of content
6464+6565+ {2 Namespaces}
6666+6767+ HTML5 can embed content from other XML vocabularies. Elements belong to
6868+ one of three {i namespaces}:
6969+7070+ - {b HTML namespace} ([None] or implicit): Standard HTML elements like
7171+ [<div>], [<p>], [<table>]. This is the default for all elements.
7272+7373+ - {b SVG namespace} ([Some "svg"]): Scalable Vector Graphics for drawing.
7474+ When the parser encounters an [<svg>] tag, all elements inside it
7575+ (like [<rect>], [<circle>], [<path>]) are placed in the SVG namespace.
7676+7777+ - {b MathML namespace} ([Some "mathml"]): Mathematical Markup Language
7878+ for equations. When the parser encounters a [<math>] tag, elements
7979+ inside it are placed in the MathML namespace.
8080+8181+ The parser automatically switches namespaces when entering and leaving
8282+ these foreign content islands.
8383+8484+ @see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inforeign>
8585+ WHATWG: Parsing foreign content
8686+8787+ {2 Tree Structure}
8888+8989+ Nodes form a bidirectional tree: each node has a list of children and
9090+ an optional parent reference. Modification functions in this module
9191+ maintain these references automatically.
9292+9393+ The tree is always well-formed: a node can only have one parent, and
9494+ circular references are not possible.
9595+*)
9696+9797+(** {1 Types} *)
9898+9999+(** Information associated with a DOCTYPE node.
100100+101101+ The {i document type declaration} (DOCTYPE) tells browsers what version
102102+ of HTML the document uses. In HTML5, the standard declaration is simply:
103103+104104+ {v <!DOCTYPE html> v}
105105+106106+ This minimal DOCTYPE triggers {i standards mode} (no quirks). The DOCTYPE
107107+ can optionally include a public identifier and system identifier for
108108+ legacy compatibility with SGML-based tools, but these are rarely used
109109+ in modern HTML5 documents.
110110+111111+ {b Historical context:} In HTML4 and XHTML, DOCTYPEs were verbose and
112112+ referenced DTD files. For example:
113113+ {v <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
114114+ "http://www.w3.org/TR/html4/strict.dtd"> v}
115115+116116+ HTML5 simplified this to just [<!DOCTYPE html>] because:
117117+ - Browsers never actually fetched or validated against DTDs
118118+ - The DOCTYPE's only real purpose is triggering standards mode
119119+ - A minimal DOCTYPE achieves this goal
120120+121121+ {b Field meanings:}
122122+ - [name]: The document type name, almost always ["html"] for HTML documents
123123+ - [public_id]: A public identifier (legacy); [None] for HTML5
124124+ - [system_id]: A system identifier/URL (legacy); [None] for HTML5
125125+126126+ @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
127127+ WHATWG: The DOCTYPE
128128+ @see <https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode>
129129+ WHATWG: DOCTYPE handling during parsing
130130+*)
131131+type doctype_data = Node.doctype_data = {
132132+ name : string option; (** The DOCTYPE name, e.g., "html" *)
133133+ public_id : string option; (** Public identifier (legacy, rarely used) *)
134134+ system_id : string option; (** System identifier (legacy, rarely used) *)
135135+}
136136+137137+(** Quirks mode setting for the document.
138138+139139+ {i Quirks mode} is a browser rendering mode that emulates bugs and
140140+ non-standard behaviors from older browsers (primarily Internet Explorer 5).
141141+ Modern HTML5 documents should always render in {i standards mode}
142142+ (no quirks) for consistent, predictable behavior.
143143+144144+ The HTML5 parser determines quirks mode based on the DOCTYPE declaration:
145145+146146+ - {b No_quirks} (Standards mode): The document renders according to modern
147147+ HTML5 and CSS specifications. This is triggered by [<!DOCTYPE html>].
148148+ CSS box model, table layout, and other features work as specified.
149149+150150+ - {b Quirks} (Full quirks mode): The document renders with legacy browser
151151+ bugs emulated. This happens when:
152152+ {ul
153153+ {- DOCTYPE is missing entirely}
154154+ {- DOCTYPE has certain legacy public identifiers}
155155+ {- DOCTYPE has the wrong format}}
156156+157157+ In quirks mode, many CSS properties behave differently:
158158+ {ul
159159+ {- Tables don't inherit font properties}
160160+ {- Box model uses non-standard width calculations}
161161+ {- Certain CSS selectors don't work correctly}}
162162+163163+ - {b Limited_quirks} (Almost standards mode): A middle ground that applies
164164+ only a few specific quirks, primarily affecting table cell vertical
165165+ sizing. Triggered by XHTML DOCTYPEs and certain HTML4 DOCTYPEs.
166166+167167+ {b Recommendation:} Always use [<!DOCTYPE html>] at the start of HTML5
168168+ documents to ensure {b No_quirks} mode.
169169+170170+ @see <https://quirks.spec.whatwg.org/>
171171+ Quirks Mode Standard - detailed specification
172172+ @see <https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode>
173173+ WHATWG: How the parser determines quirks mode
174174+*)
175175+type quirks_mode = Node.quirks_mode = No_quirks | Quirks | Limited_quirks
176176+177177+(** A DOM node in the parsed document tree.
178178+179179+ All node types use the same record structure. The [name] field determines
180180+ the node type:
181181+ - Element: the tag name (e.g., "div", "p", "span")
182182+ - Text: "#text"
183183+ - Comment: "#comment"
184184+ - Document: "#document"
185185+ - Document fragment: "#document-fragment"
186186+ - Doctype: "!doctype"
187187+188188+ {3 Understanding Node Fields}
189189+190190+ Different node types use different combinations of fields:
191191+192192+ {v
193193+ Node Type | name | namespace | attrs | data | template_content | doctype
194194+ ------------------|------------------|-----------|-------|------|------------------|--------
195195+ Element | tag name | Yes | Yes | No | If <template> | No
196196+ Text | "#text" | No | No | Yes | No | No
197197+ Comment | "#comment" | No | No | Yes | No | No
198198+ Document | "#document" | No | No | No | No | No
199199+ Document Fragment | "#document-frag" | No | No | No | No | No
200200+ Doctype | "!doctype" | No | No | No | No | Yes
201201+ v}
202202+203203+ {3 Element Tag Names}
204204+205205+ For element nodes, the [name] field contains the lowercase tag name.
206206+ HTML5 defines many elements with specific meanings:
207207+208208+ {b Structural elements:} [html], [head], [body], [header], [footer],
209209+ [main], [nav], [article], [section], [aside]
210210+211211+ {b Text content:} [p], [div], [span], [h1]-[h6], [pre], [blockquote]
212212+213213+ {b Lists:} [ul], [ol], [li], [dl], [dt], [dd]
214214+215215+ {b Tables:} [table], [tr], [td], [th], [thead], [tbody], [tfoot]
216216+217217+ {b Forms:} [form], [input], [button], [select], [textarea], [label]
218218+219219+ {b Media:} [img], [audio], [video], [canvas], [svg]
220220+221221+ @see <https://html.spec.whatwg.org/multipage/indices.html#elements-3>
222222+ WHATWG: Index of HTML elements
223223+224224+ {3 Void Elements}
225225+226226+ Some elements are {i void elements} - they cannot have children and have
227227+ no end tag. These include: [area], [base], [br], [col], [embed], [hr],
228228+ [img], [input], [link], [meta], [source], [track], [wbr].
229229+230230+ @see <https://html.spec.whatwg.org/multipage/syntax.html#void-elements>
231231+ WHATWG: Void elements
232232+233233+ {3 The Template Element}
234234+235235+ The [<template>] element is special: its children are not rendered
236236+ directly but stored in a separate document fragment accessible via
237237+ the [template_content] field. Templates are used for client-side
238238+ templating where content is cloned and inserted via JavaScript.
239239+240240+ @see <https://html.spec.whatwg.org/multipage/scripting.html#the-template-element>
241241+ WHATWG: The template element
242242+*)
243243+type node = Node.node = {
244244+ mutable name : string;
245245+ (** Tag name for elements, or special name for other node types.
246246+247247+ For elements, this is the lowercase tag name (e.g., "div", "span").
248248+ For other node types, use the constants {!document_name},
249249+ {!text_name}, {!comment_name}, etc. *)
250250+251251+ mutable namespace : string option;
252252+ (** Element namespace: [None] for HTML, [Some "svg"], [Some "mathml"].
253253+254254+ Most elements are in the HTML namespace ([None]). The SVG and MathML
255255+ namespaces are only used when content appears inside [<svg>] or
256256+ [<math>] elements respectively.
257257+258258+ @see <https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom>
259259+ WHATWG: Elements in the DOM *)
260260+261261+ mutable attrs : (string * string) list;
262262+ (** Element attributes as (name, value) pairs.
263263+264264+ Attributes provide additional information about elements. Common
265265+ global attributes include:
266266+ - [id]: Unique identifier for the element
267267+ - [class]: Space-separated list of CSS class names
268268+ - [style]: Inline CSS styles
269269+ - [title]: Advisory text (shown as tooltip)
270270+ - [lang]: Language of the element's content
271271+ - [hidden]: Whether the element should be hidden
272272+273273+ Element-specific attributes include:
274274+ - [href] on [<a>]: The link destination URL
275275+ - [src] on [<img>]: The image source URL
276276+ - [type] on [<input>]: The input control type
277277+ - [disabled] on form controls: Whether the control is disabled
278278+279279+ In HTML5, attribute names are case-insensitive and are normalized
280280+ to lowercase by the parser.
281281+282282+ @see <https://html.spec.whatwg.org/multipage/dom.html#global-attributes>
283283+ WHATWG: Global attributes
284284+ @see <https://html.spec.whatwg.org/multipage/indices.html#attributes-3>
285285+ WHATWG: Index of attributes *)
286286+287287+ mutable children : node list;
288288+ (** Child nodes in document order.
289289+290290+ For most elements, this list contains the nested elements and text.
291291+ For void elements (like [<br>], [<img>]), this is always empty.
292292+ For [<template>] elements, the actual content is in
293293+ [template_content], not here. *)
294294+295295+ mutable parent : node option;
296296+ (** Parent node, [None] for root nodes.
297297+298298+ Every node except the Document node has a parent. This back-reference
299299+ enables traversing up the tree. *)
300300+301301+ mutable data : string;
302302+ (** Text content for text and comment nodes.
303303+304304+ For text nodes, this contains the actual text. For comment nodes,
305305+ this contains the comment text (without the [<!--] and [-->]
306306+ delimiters). For other node types, this field is empty. *)
307307+308308+ mutable template_content : node option;
309309+ (** Document fragment for [<template>] element contents.
310310+311311+ The [<template>] element holds "inert" content that is not
312312+ rendered but can be cloned and inserted elsewhere. This field
313313+ contains a document fragment with the template's content.
314314+315315+ For non-template elements, this is [None].
316316+317317+ @see <https://html.spec.whatwg.org/multipage/scripting.html#the-template-element>
318318+ WHATWG: The template element *)
319319+320320+ mutable doctype : doctype_data option;
321321+ (** DOCTYPE information for doctype nodes.
322322+323323+ Only doctype nodes use this field; for all other nodes it is [None]. *)
324324+}
325325+326326+(** {1 Node Name Constants}
327327+328328+ These constants identify special node types. Compare with [node.name]
329329+ to determine the node type.
330330+*)
331331+332332+val document_name : string
333333+(** ["#document"] - name for document nodes.
334334+335335+ The Document node is the root of every HTML document tree. It represents
336336+ the entire document and is the parent of the [<html>] element.
337337+338338+ @see <https://html.spec.whatwg.org/multipage/dom.html#document>
339339+ WHATWG: The Document object *)
340340+341341+val document_fragment_name : string
342342+(** ["#document-fragment"] - name for document fragment nodes.
343343+344344+ Document fragments are lightweight container nodes used to hold a
345345+ collection of nodes without a parent document. They are used:
346346+ - To hold [<template>] element contents
347347+ - As results of fragment parsing (innerHTML)
348348+ - For efficient batch DOM operations
349349+350350+ @see <https://dom.spec.whatwg.org/#documentfragment>
351351+ DOM Standard: DocumentFragment *)
352352+353353+val text_name : string
354354+(** ["#text"] - name for text nodes.
355355+356356+ Text nodes contain the character data within elements. When the
357357+ parser encounters text between tags like [<p>Hello world</p>],
358358+ it creates a text node with data ["Hello world"] as a child of
359359+ the [<p>] element.
360360+361361+ Adjacent text nodes are automatically merged by the parser. *)
362362+363363+val comment_name : string
364364+(** ["#comment"] - name for comment nodes.
365365+366366+ Comment nodes represent HTML comments: [<!-- comment text -->].
367367+ Comments are preserved in the DOM but not rendered to users.
368368+ They're useful for development notes or conditional content. *)
369369+370370+val doctype_name : string
371371+(** ["!doctype"] - name for doctype nodes.
372372+373373+ The DOCTYPE node represents the [<!DOCTYPE html>] declaration.
374374+ It is always the first child of the Document node (if present).
375375+376376+ @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
377377+ WHATWG: The DOCTYPE *)
378378+379379+(** {1 Constructors}
380380+381381+ Functions to create new DOM nodes. All nodes start with no parent and
382382+ no children. Use {!append_child} or {!insert_before} to build a tree.
383383+*)
384384+385385+val create_element :
386386+ string ->
387387+ ?namespace:string option ->
388388+ ?attrs:(string * string) list ->
389389+ unit ->
390390+ node
391391+(** Create an element node.
392392+393393+ Elements are the primary building blocks of HTML documents. Each
394394+ element represents a component of the document with semantic meaning.
395395+396396+ @param name The tag name (e.g., "div", "p", "span"). Tag names are
397397+ case-insensitive in HTML; by convention, use lowercase.
398398+ @param namespace Element namespace:
399399+ - [None] (default): HTML namespace for standard elements
400400+ - [Some "svg"]: SVG namespace for graphics elements
401401+ - [Some "mathml"]: MathML namespace for mathematical notation
402402+ @param attrs Initial attributes as [(name, value)] pairs
403403+404404+ {b Examples:}
405405+ {[
406406+ (* Simple HTML element *)
407407+ let div = create_element "div" ()
408408+409409+ (* Element with attributes *)
410410+ let link = create_element "a"
411411+ ~attrs:[("href", "https://example.com"); ("class", "external")]
412412+ ()
413413+414414+ (* SVG element *)
415415+ let rect = create_element "rect"
416416+ ~namespace:(Some "svg")
417417+ ~attrs:[("width", "100"); ("height", "50"); ("fill", "blue")]
418418+ ()
419419+ ]}
420420+421421+ @see <https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom>
422422+ WHATWG: Elements in the DOM
423423+*)
424424+425425+val create_text : string -> node
426426+(** Create a text node with the given content.
427427+428428+ Text nodes contain the readable content of HTML documents. They
429429+ appear as children of elements and represent the characters that
430430+ users see.
431431+432432+ {b Note:} Text content is stored as-is. Character references like
433433+ [&] should already be decoded to their character values.
434434+435435+ {b Example:}
436436+ {[
437437+ let text = create_text "Hello, world!"
438438+ (* To put text in a paragraph: *)
439439+ let p = create_element "p" () in
440440+ append_child p text
441441+ ]}
442442+*)
443443+444444+val create_comment : string -> node
445445+(** Create a comment node with the given content.
446446+447447+ Comments are human-readable notes in HTML that don't appear in
448448+ the rendered output. They're written as [<!-- comment -->] in HTML.
449449+450450+ @param data The comment text (without the [<!--] and [-->] delimiters)
451451+452452+ {b Example:}
453453+ {[
454454+ let comment = create_comment " TODO: Add navigation "
455455+ (* Represents: <!-- TODO: Add navigation --> *)
456456+ ]}
457457+458458+ @see <https://html.spec.whatwg.org/multipage/syntax.html#comments>
459459+ WHATWG: HTML comments
460460+*)
461461+462462+val create_document : unit -> node
463463+(** Create an empty document node.
464464+465465+ The Document node is the root of an HTML document tree. It represents
466466+ the entire document and serves as the parent for the DOCTYPE (if any)
467467+ and the root [<html>] element.
468468+469469+ In a complete HTML document, the structure is:
470470+ {v
471471+ #document
472472+ ├── !doctype
473473+ └── html
474474+ ├── head
475475+ └── body
476476+ v}
477477+478478+ @see <https://html.spec.whatwg.org/multipage/dom.html#document>
479479+ WHATWG: The Document object
480480+*)
481481+482482+val create_document_fragment : unit -> node
483483+(** Create an empty document fragment.
484484+485485+ Document fragments are lightweight containers that can hold multiple
486486+ nodes without being part of the main document tree. They're useful for:
487487+488488+ - {b Template contents:} The [<template>] element stores its children
489489+ in a document fragment, keeping them inert until cloned
490490+491491+ - {b Fragment parsing:} When parsing HTML fragments (like innerHTML),
492492+ the result is placed in a document fragment
493493+494494+ - {b Batch operations:} Build a subtree in a fragment, then insert it
495495+ into the document in one operation for better performance
496496+497497+ @see <https://dom.spec.whatwg.org/#documentfragment>
498498+ DOM Standard: DocumentFragment
499499+*)
500500+501501+val create_doctype :
502502+ ?name:string -> ?public_id:string -> ?system_id:string -> unit -> node
503503+(** Create a DOCTYPE node.
504504+505505+ The DOCTYPE declaration tells browsers to use standards mode for
506506+ rendering. For HTML5 documents, use:
507507+508508+ {[
509509+ let doctype = create_doctype ~name:"html" ()
510510+ (* Represents: <!DOCTYPE html> *)
511511+ ]}
512512+513513+ @param name DOCTYPE name (usually ["html"] for HTML documents)
514514+ @param public_id Public identifier (legacy, rarely needed)
515515+ @param system_id System identifier (legacy, rarely needed)
516516+517517+ {b Legacy example:}
518518+ {[
519519+ (* HTML 4.01 Strict DOCTYPE - not recommended for new documents *)
520520+ let legacy = create_doctype
521521+ ~name:"HTML"
522522+ ~public_id:"-//W3C//DTD HTML 4.01//EN"
523523+ ~system_id:"http://www.w3.org/TR/html4/strict.dtd"
524524+ ()
525525+ ]}
526526+527527+ @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
528528+ WHATWG: The DOCTYPE
529529+*)
530530+531531+val create_template :
532532+ ?namespace:string option -> ?attrs:(string * string) list -> unit -> node
533533+(** Create a [<template>] element with its content document fragment.
534534+535535+ The [<template>] element holds inert HTML content that is not
536536+ rendered directly. The content is stored in a separate document
537537+ fragment and can be:
538538+ - Cloned and inserted into the document via JavaScript
539539+ - Used as a stamping template for repeated content
540540+ - Pre-parsed without affecting the page
541541+542542+ {b How templates work:}
543543+544544+ Unlike normal elements, a [<template>]'s children are not rendered.
545545+ Instead, they're stored in the [template_content] field. This means:
546546+ - Images inside won't load
547547+ - Scripts inside won't execute
548548+ - The content is "inert" until explicitly activated
549549+550550+ {b Example:}
551551+ {[
552552+ let template = create_template () in
553553+ let div = create_element "div" () in
554554+ let text = create_text "Template content" in
555555+ append_child div text;
556556+ (* Add to template's content fragment, not children *)
557557+ match template.template_content with
558558+ | Some fragment -> append_child fragment div
559559+ | None -> ()
560560+ ]}
561561+562562+ @see <https://html.spec.whatwg.org/multipage/scripting.html#the-template-element>
563563+ WHATWG: The template element
564564+*)
565565+566566+(** {1 Node Type Predicates}
567567+568568+ Functions to test what type of node you have. Since all nodes use the
569569+ same record type, these predicates check the [name] field to determine
570570+ the actual node type.
571571+*)
572572+573573+val is_element : node -> bool
574574+(** [is_element node] returns [true] if the node is an element node.
575575+576576+ Elements are HTML tags like [<div>], [<p>], [<a>]. They are
577577+ identified by having a tag name that doesn't match any of the
578578+ special node name constants.
579579+*)
580580+581581+val is_text : node -> bool
582582+(** [is_text node] returns [true] if the node is a text node.
583583+584584+ Text nodes contain the character content within elements.
585585+ They have [name = "#text"]. *)
586586+587587+val is_comment : node -> bool
588588+(** [is_comment node] returns [true] if the node is a comment node.
589589+590590+ Comment nodes represent HTML comments [<!-- ... -->].
591591+ They have [name = "#comment"]. *)
592592+593593+val is_document : node -> bool
594594+(** [is_document node] returns [true] if the node is a document node.
595595+596596+ The document node is the root of the DOM tree.
597597+ It has [name = "#document"]. *)
598598+599599+val is_document_fragment : node -> bool
600600+(** [is_document_fragment node] returns [true] if the node is a document fragment.
601601+602602+ Document fragments are lightweight containers.
603603+ They have [name = "#document-fragment"]. *)
604604+605605+val is_doctype : node -> bool
606606+(** [is_doctype node] returns [true] if the node is a DOCTYPE node.
607607+608608+ DOCTYPE nodes represent the [<!DOCTYPE>] declaration.
609609+ They have [name = "!doctype"]. *)
610610+611611+val has_children : node -> bool
612612+(** [has_children node] returns [true] if the node has any children.
613613+614614+ Note: For [<template>] elements, this checks the direct children list,
615615+ not the template content fragment. *)
616616+617617+(** {1 Tree Manipulation}
618618+619619+ Functions to modify the DOM tree structure. These functions automatically
620620+ maintain parent/child references, ensuring the tree remains consistent.
621621+*)
622622+623623+val append_child : node -> node -> unit
624624+(** [append_child parent child] adds [child] as the last child of [parent].
625625+626626+ The child's parent reference is updated to point to [parent].
627627+ If the child already has a parent, it is first removed from that parent.
628628+629629+ {b Example:}
630630+ {[
631631+ let body = create_element "body" () in
632632+ let p = create_element "p" () in
633633+ let text = create_text "Hello!" in
634634+ append_child p text;
635635+ append_child body p
636636+ (* Result:
637637+ body
638638+ └── p
639639+ └── #text "Hello!"
640640+ *)
641641+ ]}
642642+*)
643643+644644+val insert_before : node -> node -> node -> unit
645645+(** [insert_before parent new_child ref_child] inserts [new_child] before
646646+ [ref_child] in [parent]'s children.
647647+648648+ @param parent The parent node
649649+ @param new_child The node to insert
650650+ @param ref_child The existing child to insert before
651651+652652+ Raises [Not_found] if [ref_child] is not a child of [parent].
653653+654654+ {b Example:}
655655+ {[
656656+ let ul = create_element "ul" () in
657657+ let li1 = create_element "li" () in
658658+ let li3 = create_element "li" () in
659659+ append_child ul li1;
660660+ append_child ul li3;
661661+ let li2 = create_element "li" () in
662662+ insert_before ul li2 li3
663663+ (* Result: ul contains li1, li2, li3 in that order *)
664664+ ]}
665665+*)
666666+667667+val remove_child : node -> node -> unit
668668+(** [remove_child parent child] removes [child] from [parent]'s children.
669669+670670+ The child's parent reference is set to [None].
671671+672672+ Raises [Not_found] if [child] is not a child of [parent].
673673+*)
674674+675675+val insert_text_at : node -> string -> node option -> unit
676676+(** [insert_text_at parent text before_node] inserts text content.
677677+678678+ If [before_node] is [None], appends at the end. If the previous sibling
679679+ is a text node, the text is merged into it (text nodes are coalesced).
680680+ Otherwise, a new text node is created.
681681+682682+ This implements the HTML5 parser's text insertion algorithm which
683683+ ensures adjacent text nodes are always merged, matching browser behavior.
684684+685685+ @see <https://html.spec.whatwg.org/multipage/parsing.html#appropriate-place-for-inserting-a-node>
686686+ WHATWG: Inserting text in the DOM
687687+*)
688688+689689+(** {1 Attribute Operations}
690690+691691+ Functions to read and modify element attributes. Attributes are
692692+ name-value pairs that provide additional information about elements.
693693+694694+ In HTML5, attribute names are case-insensitive and normalized to
695695+ lowercase by the parser.
696696+697697+ @see <https://html.spec.whatwg.org/multipage/dom.html#attributes>
698698+ WHATWG: Attributes
699699+*)
700700+701701+val get_attr : node -> string -> string option
702702+(** [get_attr node name] returns the value of attribute [name], or [None]
703703+ if the attribute doesn't exist.
704704+705705+ Attribute lookup is case-sensitive on the stored (lowercase) names.
706706+*)
707707+708708+val set_attr : node -> string -> string -> unit
709709+(** [set_attr node name value] sets attribute [name] to [value].
710710+711711+ If the attribute already exists, it is replaced.
712712+ If it doesn't exist, it is added.
713713+*)
714714+715715+val has_attr : node -> string -> bool
716716+(** [has_attr node name] returns [true] if the node has attribute [name]. *)
717717+718718+(** {1 Tree Traversal}
719719+720720+ Functions to navigate the DOM tree.
721721+*)
722722+723723+val descendants : node -> node list
724724+(** [descendants node] returns all descendant nodes in document order.
725725+726726+ This performs a depth-first traversal, returning children before
727727+ siblings at each level. The node itself is not included.
728728+729729+ {b Document order} is the order nodes appear in the HTML source:
730730+ parent before children, earlier siblings before later ones.
731731+732732+ {b Example:}
733733+ {[
734734+ (* For tree: div > (p > "hello", span > "world") *)
735735+ descendants div
736736+ (* Returns: [p; text("hello"); span; text("world")] *)
737737+ ]}
738738+*)
739739+740740+val ancestors : node -> node list
741741+(** [ancestors node] returns all ancestor nodes from parent to root.
742742+743743+ The first element is the immediate parent, the last is the root
744744+ (usually the Document node).
745745+746746+ {b Example:}
747747+ {[
748748+ (* For a text node inside: html > body > p > text *)
749749+ ancestors text_node
750750+ (* Returns: [p; body; html; #document] *)
751751+ ]}
752752+*)
753753+754754+val get_text_content : node -> string
755755+(** [get_text_content node] returns the concatenated text content.
756756+757757+ For text nodes, returns the text data directly.
758758+ For elements, recursively concatenates all descendant text content.
759759+ For other node types, returns an empty string.
760760+761761+ {b Example:}
762762+ {[
763763+ (* For: <p>Hello <b>world</b>!</p> *)
764764+ get_text_content p_element
765765+ (* Returns: "Hello world!" *)
766766+ ]}
767767+*)
768768+769769+(** {1 Cloning} *)
770770+771771+val clone : ?deep:bool -> node -> node
772772+(** [clone ?deep node] creates a copy of the node.
773773+774774+ @param deep If [true], recursively clone all descendants (default: [false])
775775+776776+ The cloned node has no parent. With [deep:false], only the node itself
777777+ is copied (with its attributes, but not its children).
778778+779779+ {b Example:}
780780+ {[
781781+ let original = create_element "div" ~attrs:[("class", "box")] () in
782782+ let shallow = clone original in
783783+ let deep = clone ~deep:true original
784784+ ]}
785785+*)
786786+787787+(** {1 Serialization} *)
788788+789789+val to_html : ?pretty:bool -> ?indent_size:int -> ?indent:int -> node -> string
790790+(** [to_html ?pretty ?indent_size ?indent node] converts a DOM node to an
791791+ HTML string.
792792+793793+ @param pretty If [true] (default), format with indentation and newlines
794794+ @param indent_size Number of spaces per indentation level (default: 2)
795795+ @param indent Starting indentation level (default: 0)
796796+ @return The HTML string representation of the node
797797+*)
798798+799799+val to_writer :
800800+ ?pretty:bool ->
801801+ ?indent_size:int ->
802802+ ?indent:int ->
803803+ Bytesrw.Bytes.Writer.t ->
804804+ node ->
805805+ unit
806806+(** [to_writer ?pretty ?indent_size ?indent writer node] streams a DOM node
807807+ as HTML to a bytes writer.
808808+809809+ This is more memory-efficient than {!to_html} for large documents as it
810810+ doesn't build intermediate strings.
811811+812812+ @param pretty If [true] (default), format with indentation and newlines
813813+ @param indent_size Number of spaces per indentation level (default: 2)
814814+ @param indent Starting indentation level (default: 0)
815815+ @param writer The bytes writer to output to
816816+*)
817817+818818+val to_test_format : ?indent:int -> node -> string
819819+(** [to_test_format ?indent node] converts a DOM node to the html5lib test
820820+ format.
821821+822822+ This format is used by the html5lib test suite for comparing parser
823823+ output. It represents the DOM tree in a human-readable, line-based format.
824824+825825+ @param indent Starting indentation level (default: 0)
826826+ @return The test format string representation
827827+*)
828828+829829+val to_text : ?separator:string -> ?strip:bool -> node -> string
830830+(** [to_text ?separator ?strip node] extracts all text content from a node.
831831+832832+ Recursively collects text from all descendant text nodes.
833833+834834+ @param separator String to insert between text nodes (default: [" "])
835835+ @param strip If [true] (default), trim whitespace from result
836836+ @return The concatenated text content
837837+*)
+101
lib/encoding/html5rw_encoding.mli
···11+(*---------------------------------------------------------------------------
22+ Copyright (c) 2025 Anil Madhavapeddy <anil@recoil.org>. All rights reserved.
33+ SPDX-License-Identifier: MIT
44+ ---------------------------------------------------------------------------*)
55+66+(** HTML5 Encoding Detection and Decoding
77+88+ This module implements the WHATWG encoding sniffing and decoding
99+ algorithms for HTML5 documents. It handles automatic character
1010+ encoding detection from byte order marks (BOM), meta charset
1111+ declarations, and transport layer hints.
1212+1313+ {2 Encoding Detection Algorithm}
1414+1515+ The encoding detection follows the WHATWG specification:
1616+ 1. Check for a BOM (UTF-8, UTF-16LE, UTF-16BE)
1717+ 2. Prescan for [<meta charset>] or [<meta http-equiv="content-type">]
1818+ 3. Use transport layer encoding hint if provided
1919+ 4. Fall back to UTF-8 as the default
2020+2121+ @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
2222+ WHATWG encoding sniffing algorithm
2323+*)
2424+2525+(** {1 Types} *)
2626+2727+(** Character encodings supported by the parser.
2828+2929+ The HTML5 specification requires support for a large number of
3030+ encodings, but this implementation focuses on the most common ones.
3131+ Other encodings are mapped to their closest equivalent.
3232+*)
3333+type encoding = Encoding.t =
3434+ | Utf8 (** UTF-8 encoding (default) *)
3535+ | Utf16le (** UTF-16 little-endian *)
3636+ | Utf16be (** UTF-16 big-endian *)
3737+ | Windows_1252 (** Windows-1252 (Latin-1 superset) *)
3838+ | Iso_8859_2 (** ISO-8859-2 (Central European) *)
3939+ | Euc_jp (** EUC-JP (Japanese) *)
4040+4141+(** {1 Encoding Utilities} *)
4242+4343+val encoding_to_string : encoding -> string
4444+(** Convert an encoding to its canonical label string.
4545+4646+ Returns the WHATWG canonical name, e.g., ["utf-8"], ["utf-16le"].
4747+*)
4848+4949+val sniff_bom : bytes -> (encoding * int) option
5050+(** Detect encoding from a byte order mark.
5151+5252+ Examines the first bytes of the input for a BOM and returns the
5353+ detected encoding with the number of bytes to skip.
5454+5555+ @return [(Some (encoding, skip_bytes))] if a BOM is found,
5656+ [None] otherwise.
5757+*)
5858+5959+val normalize_label : string -> encoding option
6060+(** Normalize an encoding label to its canonical form.
6161+6262+ Maps encoding labels (case-insensitive, with optional whitespace)
6363+ to the supported encoding types.
6464+6565+ @return [Some encoding] if the label is recognized, [None] otherwise.
6666+6767+ {[
6868+ normalize_label "UTF-8" (* Some Utf8 *)
6969+ normalize_label "utf8" (* Some Utf8 *)
7070+ normalize_label "latin1" (* Some Windows_1252 *)
7171+ ]}
7272+*)
7373+7474+val prescan_for_meta_charset : bytes -> encoding option
7575+(** Prescan bytes to find a meta charset declaration.
7676+7777+ Implements the WHATWG prescan algorithm that looks for encoding
7878+ declarations in the first 1024 bytes of an HTML document.
7979+8080+ @return [Some encoding] if a meta charset is found, [None] otherwise.
8181+*)
8282+8383+(** {1 Decoding} *)
8484+8585+val decode : bytes -> ?transport_encoding:string -> unit -> string * encoding
8686+(** Decode raw bytes to a UTF-8 string with automatic encoding detection.
8787+8888+ This function implements the full encoding sniffing algorithm:
8989+ 1. Check for BOM
9090+ 2. Prescan for meta charset
9191+ 3. Use transport encoding hint if provided
9292+ 4. Fall back to UTF-8
9393+9494+ @param transport_encoding Encoding hint from HTTP Content-Type header
9595+ @return [(decoded_string, detected_encoding)]
9696+9797+ {[
9898+ let (html, enc) = decode raw_bytes ()
9999+ (* html is now a UTF-8 string, enc is the detected encoding *)
100100+ ]}
101101+*)
+155
lib/selector/html5rw_selector.mli
···11+(*---------------------------------------------------------------------------
22+ Copyright (c) 2025 Anil Madhavapeddy <anil@recoil.org>. All rights reserved.
33+ SPDX-License-Identifier: MIT
44+ ---------------------------------------------------------------------------*)
55+66+(** CSS Selector Engine
77+88+ This module provides CSS selector parsing and matching for querying
99+ the HTML5 DOM. It supports a subset of CSS3 selectors suitable for
1010+ common web scraping and DOM manipulation tasks.
1111+1212+ {2 Supported Selectors}
1313+1414+ {3 Simple Selectors}
1515+ - Tag: [div], [p], [span]
1616+ - ID: [#myid]
1717+ - Class: [.myclass]
1818+ - Universal: [*]
1919+2020+ {3 Attribute Selectors}
2121+ - Presence: [[attr]]
2222+ - Exact match: [[attr="value"]]
2323+ - Contains word: [[attr~="value"]]
2424+ - Starts with: [[attr^="value"]]
2525+ - Ends with: [[attr$="value"]]
2626+ - Contains: [[attr*="value"]]
2727+ - Hyphen-separated: [[attr|="value"]]
2828+2929+ {3 Pseudo-classes}
3030+ - [:first-child], [:last-child]
3131+ - [:nth-child(n)], [:nth-last-child(n)]
3232+ - [:only-child]
3333+ - [:empty]
3434+ - [:not(selector)]
3535+3636+ {3 Combinators}
3737+ - Descendant: [div p] (p anywhere inside div)
3838+ - Child: [div > p] (p direct child of div)
3939+ - Adjacent sibling: [div + p] (p immediately after div)
4040+ - General sibling: [div ~ p] (p after div, same parent)
4141+4242+ {2 Usage}
4343+4444+ {[
4545+ let doc = Html5rw.parse reader in
4646+4747+ (* Find all paragraphs *)
4848+ let paragraphs = Html5rw.query doc "p" in
4949+5050+ (* Find links with specific class *)
5151+ let links = Html5rw.query doc "a.external" in
5252+5353+ (* Find table cells in rows *)
5454+ let cells = Html5rw.query doc "tr > td" in
5555+5656+ (* Check if a node matches *)
5757+ let is_active = Html5rw.matches node ".active"
5858+ ]}
5959+*)
6060+6161+(** {1 Exceptions} *)
6262+6363+exception Selector_error of string
6464+(** Raised when a selector string is malformed.
6565+6666+ The exception contains an error message describing the parse error.
6767+*)
6868+6969+(** {1 Sub-modules} *)
7070+7171+(** Abstract syntax tree for parsed selectors. *)
7272+module Ast : sig
7373+ type simple_selector_type = Selector_ast.simple_selector_type =
7474+ | Type_tag
7575+ | Type_id
7676+ | Type_class
7777+ | Type_universal
7878+ | Type_attr
7979+ | Type_pseudo
8080+8181+ type simple_selector = Selector_ast.simple_selector = {
8282+ selector_type : simple_selector_type;
8383+ name : string option;
8484+ operator : string option;
8585+ value : string option;
8686+ arg : string option;
8787+ }
8888+8989+ type compound_selector = Selector_ast.compound_selector = {
9090+ selectors : simple_selector list;
9191+ }
9292+9393+ type complex_selector = Selector_ast.complex_selector = {
9494+ parts : (string option * compound_selector) list;
9595+ }
9696+9797+ type selector_list = Selector_ast.selector_list = {
9898+ selectors : complex_selector list;
9999+ }
100100+101101+ type selector = Selector_ast.selector =
102102+ | Simple of simple_selector
103103+ | Compound of compound_selector
104104+ | Complex of complex_selector
105105+ | List of selector_list
106106+107107+ val make_simple :
108108+ simple_selector_type ->
109109+ ?name:string ->
110110+ ?operator:string ->
111111+ ?value:string ->
112112+ ?arg:string ->
113113+ unit ->
114114+ simple_selector
115115+116116+ val make_compound : simple_selector list -> compound_selector
117117+ val make_complex : (string option * compound_selector) list -> complex_selector
118118+ val make_list : complex_selector list -> selector_list
119119+end
120120+121121+(** Token types for the selector lexer. *)
122122+module Token : sig
123123+ type t = Selector_token.t
124124+end
125125+126126+(** {1 Functions} *)
127127+128128+val parse : string -> Ast.selector
129129+(** Parse a CSS selector string.
130130+131131+ @raise Selector_error if the selector is malformed.
132132+*)
133133+134134+val query : Html5rw_dom.node -> string -> Html5rw_dom.node list
135135+(** Query the DOM tree with a CSS selector.
136136+137137+ Returns all nodes matching the selector in document order.
138138+139139+ @raise Selector_error if the selector is malformed.
140140+141141+ {[
142142+ let divs = query root_node "div.content > p"
143143+ ]}
144144+*)
145145+146146+val matches : Html5rw_dom.node -> string -> bool
147147+(** Check if a node matches a CSS selector.
148148+149149+ @raise Selector_error if the selector is malformed.
150150+151151+ {[
152152+ if matches node ".active" then
153153+ (* node has class "active" *)
154154+ ]}
155155+*)
+223
lib/tokenizer/html5rw_tokenizer.mli
···11+(*---------------------------------------------------------------------------
22+ Copyright (c) 2025 Anil Madhavapeddy <anil@recoil.org>. All rights reserved.
33+ SPDX-License-Identifier: MIT
44+ ---------------------------------------------------------------------------*)
55+66+(** HTML5 Tokenizer
77+88+ This module implements the WHATWG HTML5 tokenization algorithm. The
99+ tokenizer converts an input byte stream into a sequence of tokens
1010+ (start tags, end tags, text, comments, doctypes) that can be consumed
1111+ by a tree builder.
1212+*)
1313+1414+(** {1 Sub-modules} *)
1515+1616+(** Token types produced by the tokenizer. *)
1717+module Token : sig
1818+ type tag_kind = Token.tag_kind = Start | End
1919+2020+ type doctype = Token.doctype = {
2121+ name : string option;
2222+ public_id : string option;
2323+ system_id : string option;
2424+ force_quirks : bool;
2525+ }
2626+2727+ type tag = Token.tag = {
2828+ kind : tag_kind;
2929+ name : string;
3030+ attrs : (string * string) list;
3131+ self_closing : bool;
3232+ }
3333+3434+ type t = Token.t =
3535+ | Tag of tag
3636+ | Character of string
3737+ | Comment of string
3838+ | Doctype of doctype
3939+ | EOF
4040+4141+ val make_start_tag : string -> (string * string) list -> bool -> t
4242+ val make_end_tag : string -> t
4343+ val make_doctype :
4444+ ?name:string ->
4545+ ?public_id:string ->
4646+ ?system_id:string ->
4747+ ?force_quirks:bool ->
4848+ unit ->
4949+ t
5050+ val make_comment : string -> t
5151+ val make_character : string -> t
5252+ val eof : t
5353+end
5454+5555+(** Tokenizer states. *)
5656+module State : sig
5757+ type t = State.t =
5858+ | Data
5959+ | Rcdata
6060+ | Rawtext
6161+ | Script_data
6262+ | Plaintext
6363+ | Tag_open
6464+ | End_tag_open
6565+ | Tag_name
6666+ | Rcdata_less_than_sign
6767+ | Rcdata_end_tag_open
6868+ | Rcdata_end_tag_name
6969+ | Rawtext_less_than_sign
7070+ | Rawtext_end_tag_open
7171+ | Rawtext_end_tag_name
7272+ | Script_data_less_than_sign
7373+ | Script_data_end_tag_open
7474+ | Script_data_end_tag_name
7575+ | Script_data_escape_start
7676+ | Script_data_escape_start_dash
7777+ | Script_data_escaped
7878+ | Script_data_escaped_dash
7979+ | Script_data_escaped_dash_dash
8080+ | Script_data_escaped_less_than_sign
8181+ | Script_data_escaped_end_tag_open
8282+ | Script_data_escaped_end_tag_name
8383+ | Script_data_double_escape_start
8484+ | Script_data_double_escaped
8585+ | Script_data_double_escaped_dash
8686+ | Script_data_double_escaped_dash_dash
8787+ | Script_data_double_escaped_less_than_sign
8888+ | Script_data_double_escape_end
8989+ | Before_attribute_name
9090+ | Attribute_name
9191+ | After_attribute_name
9292+ | Before_attribute_value
9393+ | Attribute_value_double_quoted
9494+ | Attribute_value_single_quoted
9595+ | Attribute_value_unquoted
9696+ | After_attribute_value_quoted
9797+ | Self_closing_start_tag
9898+ | Bogus_comment
9999+ | Markup_declaration_open
100100+ | Comment_start
101101+ | Comment_start_dash
102102+ | Comment
103103+ | Comment_less_than_sign
104104+ | Comment_less_than_sign_bang
105105+ | Comment_less_than_sign_bang_dash
106106+ | Comment_less_than_sign_bang_dash_dash
107107+ | Comment_end_dash
108108+ | Comment_end
109109+ | Comment_end_bang
110110+ | Doctype
111111+ | Before_doctype_name
112112+ | Doctype_name
113113+ | After_doctype_name
114114+ | After_doctype_public_keyword
115115+ | Before_doctype_public_identifier
116116+ | Doctype_public_identifier_double_quoted
117117+ | Doctype_public_identifier_single_quoted
118118+ | After_doctype_public_identifier
119119+ | Between_doctype_public_and_system_identifiers
120120+ | After_doctype_system_keyword
121121+ | Before_doctype_system_identifier
122122+ | Doctype_system_identifier_double_quoted
123123+ | Doctype_system_identifier_single_quoted
124124+ | After_doctype_system_identifier
125125+ | Bogus_doctype
126126+ | Cdata_section
127127+ | Cdata_section_bracket
128128+ | Cdata_section_end
129129+ | Character_reference
130130+ | Named_character_reference
131131+ | Ambiguous_ampersand
132132+ | Numeric_character_reference
133133+ | Hexadecimal_character_reference_start
134134+ | Decimal_character_reference_start
135135+ | Hexadecimal_character_reference
136136+ | Decimal_character_reference
137137+ | Numeric_character_reference_end
138138+end
139139+140140+(** Parse error types. *)
141141+module Errors : sig
142142+ type t = Errors.t = {
143143+ code : string;
144144+ line : int;
145145+ column : int;
146146+ }
147147+148148+ val make : code:string -> line:int -> column:int -> t
149149+ val to_string : t -> string
150150+end
151151+152152+(** Input stream with position tracking. *)
153153+module Stream : sig
154154+ type t = Stream.t
155155+156156+ val create : string -> t
157157+ val create_from_reader : Bytesrw.Bytes.Reader.t -> t
158158+ val set_error_callback : t -> (string -> unit) -> unit
159159+ val position : t -> int * int
160160+end
161161+162162+(** {1 Token Sink Interface} *)
163163+164164+(** Interface for token consumers.
165165+166166+ The tokenizer calls [process] for each token it produces. The sink
167167+ can return [`Continue] to keep tokenizing, or [`SwitchTo state] to
168168+ change the tokenizer state (used by the tree builder for things like
169169+ [<script>] and [<textarea>]).
170170+*)
171171+module type SINK = sig
172172+ type t
173173+ val process : t -> Token.t -> [ `Continue | `SwitchTo of State.t ]
174174+ val adjusted_current_node_in_html_namespace : t -> bool
175175+end
176176+177177+(** {1 Tokenizer} *)
178178+179179+(** The tokenizer type, parameterized by the sink type. *)
180180+type 'sink t
181181+182182+val create :
183183+ (module SINK with type t = 'sink) ->
184184+ 'sink ->
185185+ ?collect_errors:bool ->
186186+ ?xml_mode:bool ->
187187+ unit ->
188188+ 'sink t
189189+(** Create a new tokenizer.
190190+191191+ @param sink The token sink that will receive tokens
192192+ @param collect_errors If [true], collect parse errors (default: [false])
193193+ @param xml_mode If [true], apply XML compatibility transformations
194194+*)
195195+196196+val run :
197197+ 'sink t ->
198198+ (module SINK with type t = 'sink) ->
199199+ Bytesrw.Bytes.Reader.t ->
200200+ unit
201201+(** Run the tokenizer on the given input.
202202+203203+ The tokenizer will read from the reader and call the sink's [process]
204204+ function for each token until EOF is reached.
205205+*)
206206+207207+val get_errors : 'sink t -> Errors.t list
208208+(** Get the list of parse errors encountered during tokenization.
209209+210210+ Only populated if [collect_errors:true] was passed to {!create}.
211211+*)
212212+213213+val set_state : 'sink t -> State.t -> unit
214214+(** Set the tokenizer state.
215215+216216+ Used by the tree builder to switch states for raw text elements.
217217+*)
218218+219219+val set_last_start_tag : 'sink t -> string -> unit
220220+(** Set the last start tag name.
221221+222222+ Used by the tree builder to track the context for end tag matching.
223223+*)