Phase 3, Issue 3: html5lib Tokenizer Test Harness#
Wire up the html5lib-tests tokenizer test suite to validate our HTML5 tokenizer.
Requirements#
The tests/html5lib-tests/ submodule contains JSON test files for HTML tokenizer conformance. Implement a test harness that:
- Reads JSON test files from
tests/html5lib-tests/tokenizer/ - Parses each test case (input, expected output tokens, optional initial state, optional errors)
- Runs our tokenizer against the input
- Compares output tokens to expected results
- Reports pass/fail per test case
Test file format (JSON):
{
"tests": [
{
"description": "test name",
"input": "<html>",
"output": [["StartTag", "html", {}]],
"errors": [{"code": "...", "line": 1, "col": 1}]
}
]
}
Token mapping:
["DOCTYPE", name, public_id, system_id, correctness]→Token::Doctype["StartTag", name, attrs, self_closing?]→Token::StartTag["EndTag", name]→Token::EndTag["Character", data]→Token::Character["Comment", data]→Token::Comment
Location: tests/html5lib_tokenizer.rs (already exists with a stub JSON parser)
Acceptance criteria#
- Test harness loads and parses all html5lib tokenizer JSON files
- Each test case runs our tokenizer and compares output
- Test results are reported clearly (which tests pass/fail)
- Initial pass rate is tracked (doesn't need to be 100% yet)
-
cargo test -p we-html --test html5lib_tokenizerruns successfully - No unsafe code
- No external dependencies (use the existing JSON parser or extend it)