html5lib tokenizer test harness #33

Phase 3, Issue 3: html5lib Tokenizer Test Harness#

Wire up the html5lib-tests tokenizer test suite to validate our HTML5 tokenizer.

Requirements#

The tests/html5lib-tests/ submodule contains JSON test files for HTML tokenizer conformance. Implement a test harness that:

Reads JSON test files from tests/html5lib-tests/tokenizer/
Parses each test case (input, expected output tokens, optional initial state, optional errors)
Runs our tokenizer against the input
Compares output tokens to expected results
Reports pass/fail per test case

Test file format (JSON):

{
  "tests": [
    {
      "description": "test name",
      "input": "<html>",
      "output": [["StartTag", "html", {}]],
      "errors": [{"code": "...", "line": 1, "col": 1}]
    }
  ]
}

Token mapping:

["DOCTYPE", name, public_id, system_id, correctness] → Token::Doctype
["StartTag", name, attrs, self_closing?] → Token::StartTag
["EndTag", name] → Token::EndTag
["Character", data] → Token::Character
["Comment", data] → Token::Comment

Location: tests/html5lib_tokenizer.rs (already exists with a stub JSON parser)

Acceptance criteria#

Test harness loads and parses all html5lib tokenizer JSON files
Each test case runs our tokenizer and compares output
Test results are reported clearly (which tests pass/fail)
Initial pass rate is tracked (doesn't need to be 100% yet)
cargo test -p we-html --test html5lib_tokenizer runs successfully
No unsafe code
No external dependencies (use the existing JSON parser or extend it)