web engine - experimental web browser

HTML5 tokenizer state machine #32

open opened by pierrelf.com

Phase 3, Issue 2: HTML5 Tokenizer#

Implement a spec-compliant HTML5 tokenizer in the html crate, replacing the current stub.

Requirements#

Implement the HTML5 tokenizer as a state machine per the WHATWG HTML spec (ยง13.2.5). The tokenizer reads input characters and emits Token values.

States to implement (minimum for Phase 3 subset):

  • Data state
  • Tag open state
  • End tag open state
  • Tag name state
  • Before attribute name state
  • Attribute name state
  • After attribute name state
  • Before attribute value state
  • Attribute value (double-quoted) state
  • Attribute value (single-quoted) state
  • Attribute value (unquoted) state
  • After attribute value (quoted) state
  • Self-closing start tag state
  • Bogus comment state
  • Markup declaration open state (for <!-- and <!DOCTYPE)
  • Comment start state, Comment state, Comment end state
  • DOCTYPE state, Before DOCTYPE name, DOCTYPE name, After DOCTYPE name
  • Character reference state (numeric + named, at least &amp;, &lt;, &gt;, &quot;)

Token types (already defined in lib.rs):

  • Doctype, StartTag, EndTag, Character, Comment, Eof

API:

  • Tokenizer::new(input: &str) -> Tokenizer
  • Tokenizer::next_token() -> Token (or implement Iterator<Item = Token>)
  • Convenience: tokenize(input: &str) -> Vec<Token> (existing signature)

Error handling:

  • Parse errors should be collected but not fatal (per spec)
  • Emit tokens even in error conditions

Acceptance criteria#

  • Tokenizes basic HTML: <html><head><title>Test</title></head><body><p>Hello</p></body></html>
  • Handles self-closing tags: <br/>, <img/>
  • Handles attributes: <a href="url" class="link">
  • Handles comments: <!-- comment -->
  • Handles DOCTYPE: <!DOCTYPE html>
  • Handles character references: &amp;, &lt;, &gt;, &#65;, &#x41;
  • Emits Character tokens for text content
  • cargo clippy -p we-html -- -D warnings passes
  • cargo test -p we-html passes with comprehensive unit tests
  • No unsafe code
  • No external dependencies
sign up or login to add to the discussion
Labels

None yet.

assignee

None yet.

Participants 1
AT URI
at://did:plc:meotu43t6usg4qdwzenk4s2t/sh.tangled.repo.issue/3mfubshhrit2x