tangled
alpha
login
or
join now
anil.recoil.org
/
ocaml-html5rw
1
fork
atom
OCaml HTML5 parser/serialiser based on Python's JustHTML
1
fork
atom
overview
issues
pulls
pipelines
mention checker
anil.recoil.org
3 months ago
2115317b
dc8c8b6c
0/1
build.yml
failed
2m 48s
+71
-3
1 changed file
expand all
collapse all
unified
split
README.md
+71
-3
README.md
reviewed
···
1
1
-
# html5rw - Pure OCaml HTML5 Parser
1
1
+
# html5rw - Pure OCaml HTML5 Parser and Conformance Checker
2
2
3
3
-
A pure OCaml HTML5 parser implementing the WHATWG HTML5 parsing specification.
3
3
+
A pure OCaml HTML5 parser and validator implementing the WHATWG HTML5 specification.
4
4
This library passes the html5lib-tests suite and provides full support for
5
5
-
tokenization, tree construction, encoding detection, and CSS selector queries.
5
5
+
tokenization, tree construction, encoding detection, CSS selector queries, and
6
6
+
conformance checking.
6
7
This library was ported from [JustHTML](https://github.com/EmilStenstrom/justhtml/).
7
8
8
9
## Key Features
9
10
10
11
- **WHATWG Compliant**: Implements the full HTML5 parsing algorithm with proper error recovery
12
12
+
- **Conformance Checker**: Validates HTML5 documents against the WHATWG specification
11
13
- **CSS Selectors**: Query the DOM using standard CSS selector syntax
12
14
- **Streaming I/O**: Uses bytesrw for efficient streaming input/output
13
15
- **Encoding Detection**: Automatic character encoding detection following the WHATWG algorithm
14
16
- **Entity Decoding**: Complete HTML5 named character reference support
17
17
+
- **Multiple Output Formats**: Text, JSON (Nu validator compatible), and GNU-style output
18
18
+
19
19
+
## Libraries
20
20
+
21
21
+
- `html5rw` - Core HTML5 parser
22
22
+
- `html5rw.check` - Conformance checker library
23
23
+
24
24
+
## Command Line Tool
25
25
+
26
26
+
The `html5check` CLI validates HTML5 documents:
27
27
+
28
28
+
```bash
29
29
+
# Validate a file
30
30
+
html5check index.html
31
31
+
32
32
+
# Validate from stdin
33
33
+
cat page.html | html5check -
34
34
+
35
35
+
# JSON output (Nu validator compatible)
36
36
+
html5check --format=json page.html
37
37
+
38
38
+
# GNU-style output for IDE integration
39
39
+
html5check --format=gnu page.html
40
40
+
41
41
+
# Show only errors (suppress warnings)
42
42
+
html5check --errors-only page.html
43
43
+
44
44
+
# Quiet mode - show only counts
45
45
+
html5check --quiet page.html
46
46
+
```
47
47
+
48
48
+
Exit codes: 0 = valid, 1 = validation errors, 2 = I/O error.
15
49
16
50
## Usage
51
51
+
52
52
+
### Parsing HTML
17
53
18
54
```ocaml
19
55
open Bytesrw
···
41
77
let reader = Bytes.Reader.of_string "<p>Fragment content</p>"
42
78
let doc = Html5rw.parse ~fragment_context:ctx reader
43
79
```
80
80
+
81
81
+
### Validating HTML
82
82
+
83
83
+
```ocaml
84
84
+
open Bytesrw
85
85
+
86
86
+
(* Check HTML from a string *)
87
87
+
let html = "<html><body><p>Hello</p></body></html>"
88
88
+
let reader = Bytes.Reader.of_string html
89
89
+
let result = Htmlrw_check.check reader
90
90
+
91
91
+
(* Check for errors *)
92
92
+
if Htmlrw_check.has_errors result then
93
93
+
print_endline "Document has errors";
94
94
+
95
95
+
(* Get all messages *)
96
96
+
let messages = Htmlrw_check.messages result in
97
97
+
List.iter (fun msg ->
98
98
+
Format.printf "%a@." Htmlrw_check.pp_message msg
99
99
+
) messages;
100
100
+
101
101
+
(* Get formatted output *)
102
102
+
let text_output = Htmlrw_check.to_text result in
103
103
+
let json_output = Htmlrw_check.to_json result in
104
104
+
let gnu_output = Htmlrw_check.to_gnu result
105
105
+
```
106
106
+
107
107
+
The checker validates:
108
108
+
- Parse errors (malformed HTML syntax)
109
109
+
- Content model violations (invalid element nesting)
110
110
+
- Attribute errors (invalid or missing required attributes)
111
111
+
- Structural issues (other conformance problems)
44
112
45
113
## Installation
46
114