Phase 3, Issue 2: HTML5 Tokenizer#
Implement a spec-compliant HTML5 tokenizer in the html crate, replacing the current stub.
Requirements#
Implement the HTML5 tokenizer as a state machine per the WHATWG HTML spec (ยง13.2.5). The tokenizer reads input characters and emits Token values.
States to implement (minimum for Phase 3 subset):
- Data state
- Tag open state
- End tag open state
- Tag name state
- Before attribute name state
- Attribute name state
- After attribute name state
- Before attribute value state
- Attribute value (double-quoted) state
- Attribute value (single-quoted) state
- Attribute value (unquoted) state
- After attribute value (quoted) state
- Self-closing start tag state
- Bogus comment state
- Markup declaration open state (for
<!--and<!DOCTYPE) - Comment start state, Comment state, Comment end state
- DOCTYPE state, Before DOCTYPE name, DOCTYPE name, After DOCTYPE name
- Character reference state (numeric + named, at least
&,<,>,")
Token types (already defined in lib.rs):
Doctype,StartTag,EndTag,Character,Comment,Eof
API:
Tokenizer::new(input: &str) -> TokenizerTokenizer::next_token() -> Token(or implementIterator<Item = Token>)- Convenience:
tokenize(input: &str) -> Vec<Token>(existing signature)
Error handling:
- Parse errors should be collected but not fatal (per spec)
- Emit tokens even in error conditions
Acceptance criteria#
- Tokenizes basic HTML:
<html><head><title>Test</title></head><body><p>Hello</p></body></html> - Handles self-closing tags:
<br/>,<img/> - Handles attributes:
<a href="url" class="link"> - Handles comments:
<!-- comment --> - Handles DOCTYPE:
<!DOCTYPE html> - Handles character references:
&,<,>,A,A - Emits Character tokens for text content
-
cargo clippy -p we-html -- -D warningspasses -
cargo test -p we-htmlpasses with comprehensive unit tests - No unsafe code
- No external dependencies