6.3
HTML: Parsing Library
The html library provides
functions to read html documents and structures to represent them.
procedure
(read-xhtml port) → html?
port : input-port? 
procedure
port : input-port? 
Reads (X)HTML from a port, producing an html instance.
procedure
(read-html-as-xml port) → (listof content/c)
port : input-port? 
Reads HTML from a port, producing a list of XML content, each of which could be
turned into an X-expression, if necessary, with xml->xexpr.
parameter
(read-html-comments v) → void? v : any/c 
If v is not #f, then comments are read and returned. Defaults to #f.
parameter
(use-html-spec v) → void? v : any/c 
If v is not #f, then the HTML must respect the HTML specification
with regards to what elements are allowed to be the children of
other elements. For example, the top-level "<html>"
element may only contain a "<body>" and "<head>"
element. Defaults to #t.
1 Example
(module html-example racket ; Some of the symbols in html and xml conflict with ; each other and with racket/base language, so we prefix ; to avoid namespace conflict. (require (prefix-in h: html) (prefix-in x: xml)) (define an-html (h:read-xhtml (open-input-string (string-append "<html><head><title>My title</title></head><body>" "<p>Hello world</p><p><b>Testing</b>!</p>" "</body></html>")))) ; extract-pcdata: html-content/c -> (listof string) ; Pulls out the pcdata strings from some-content. (define (extract-pcdata some-content) (cond [(x:pcdata? some-content) (list (x:pcdata-string some-content))] [(x:entity? some-content) (list)] [else (extract-pcdata-from-element some-content)])) ; extract-pcdata-from-element: html-element -> (listof string) ; Pulls out the pcdata strings from an-html-element. (define (extract-pcdata-from-element an-html-element) (match an-html-element [(struct h:html-full (attributes content)) (apply append (map extract-pcdata content))] [(struct h:html-element (attributes)) '()])) (printf "~s\n" (extract-pcdata an-html))) 
> (require 'html-example) ("My title" "Hello world" "Testing" "!")
2 HTML Structures
pcdata, entity, and attribute are defined in the xml documentation.
value
A html-content/c is either
struct
(struct html-element (attributes) #:extra-constructor-name make-html-element) attributes : (listof attribute) 
Any of the structures below inherits from html-element.
struct
(struct html-full struct:html-element (content) #:extra-constructor-name make-html-full) content : (listof html-content/c) 
Any html tag that may include content also inherits from
html-full without adding any additional fields.
struct
(struct mzscheme html-full () #:extra-constructor-name make-mzscheme) 
A mzscheme is special legacy value for the old documentation system.
A Contents-of-html is either
struct
(struct center html-full () #:extra-constructor-name make-center) 
struct
(struct blockquote html-full () #:extra-constructor-name make-blockquote) 
struct
(struct iframe html-full () #:extra-constructor-name make-iframe) 
struct
(struct noframes html-full () #:extra-constructor-name make-noframes) 
struct
(struct noscript html-full () #:extra-constructor-name make-noscript) 
struct
(struct style html-full () #:extra-constructor-name make-style) 
struct
(struct script html-full () #:extra-constructor-name make-script) 
struct
(struct basefont html-element () #:extra-constructor-name make-basefont) 
struct
(struct br html-element () #:extra-constructor-name make-br) 
struct
(struct area html-element () #:extra-constructor-name make-area) 
struct
(struct alink html-element () #:extra-constructor-name make-alink) 
struct
(struct img html-element () #:extra-constructor-name make-img) 
struct
(struct param html-element () #:extra-constructor-name make-param) 
struct
(struct hr html-element () #:extra-constructor-name make-hr) 
struct
(struct input html-element () #:extra-constructor-name make-input) 
struct
(struct col html-element () #:extra-constructor-name make-col) 
struct
(struct isindex html-element () #:extra-constructor-name make-isindex) 
struct
(struct base html-element () #:extra-constructor-name make-base) 
struct
(struct meta html-element () #:extra-constructor-name make-meta) 
struct
(struct option html-full () #:extra-constructor-name make-option) 
struct
(struct textarea html-full () #:extra-constructor-name make-textarea) 
struct
(struct title html-full () #:extra-constructor-name make-title) 
A Contents-of-head is either
A Contents-of-tr is either
struct
(struct colgroup html-full () #:extra-constructor-name make-colgroup) 
struct
(struct thead html-full () #:extra-constructor-name make-thead) 
struct
(struct tfoot html-full () #:extra-constructor-name make-tfoot) 
struct
(struct tbody html-full () #:extra-constructor-name make-tbody) 
struct
(struct strike html-full () #:extra-constructor-name make-strike) 
struct
(struct small html-full () #:extra-constructor-name make-small) 
struct
(struct strong html-full () #:extra-constructor-name make-strong) 
struct
(struct acronym html-full () #:extra-constructor-name make-acronym) 
struct
(struct legend html-full () #:extra-constructor-name make-legend) 
struct
(struct caption html-full () #:extra-constructor-name make-caption) 
struct
(struct table html-full () #:extra-constructor-name make-table) 
A Contents-of-table is either
struct
(struct button html-full () #:extra-constructor-name make-button) 
struct
(struct fieldset html-full () #:extra-constructor-name make-fieldset) 
A Contents-of-fieldset is either
- G2 
struct
(struct optgroup html-full () #:extra-constructor-name make-optgroup) 
struct
(struct select html-full () #:extra-constructor-name make-select) 
A Contents-of-select is either
struct
(struct label html-full () #:extra-constructor-name make-label) 
A Contents-of-dl is either
A Contents-of-pre is either
- G9 
- G11 
struct
(struct object html-full () #:extra-constructor-name make-object) 
struct
(struct applet html-full () #:extra-constructor-name make-applet) 
A Contents-of-object-applet is either
- G2 
A Contents-of-map is either
A Contents-of-a is either
- G7 
struct
(struct address html-full () #:extra-constructor-name make-address) 
A Contents-of-address is either
- G5 
A Contents-of-body is either
A G12 is either
A G11 is either
A G10 is either
A G9 is either
A G8 is either
A G7 is either
- G8 
- G12 
A G6 is either
- G7 
A G5 is either
- G6 
A G4 is either
- G8 
- G10 
A G3 is either
A G2 is either
- G3