How Browsers Parse the DOM to Render Text and Media

The DOM as the Blueprint for Rendering

When you open a web page, the browser’s rendering engine does not display raw HTML directly. Instead, it constructs an internal tree structure called the Document Object Model (DOM). This tree represents every element – paragraphs, headings, images, videos – as nodes. The DOM is the foundation for all subsequent rendering steps. Without it, the browser has no logical map of the content.

The DOM is built through a process called parsing. The browser reads the HTML byte by byte, converts characters into tokens (like tags and attributes), and then assembles those tokens into a tree. This tree is not static; it can be modified by JavaScript after initial load. For example, a script might insert a new image node, triggering a re-parse of the affected branch. The speed of this initial parse directly affects how quickly a user sees content.

Tokenization and Tree Construction

The HTML parser uses a state machine to handle malformed markup. It identifies open tags, close tags, and self-closing elements. Each token becomes a node. The parser tracks the current context – for instance, inside a “ it knows that nested `

` tags are children. This process is incremental: the browser can start rendering parts of the page even before the full HTML is downloaded.

From DOM to Render Tree: Filtering for Visibility

Once the DOM is ready, the browser combines it with the CSS Object Model (CSSOM) to create the render tree. This tree excludes invisible elements – anything with `display: none` or “ children are dropped. Only nodes that occupy visual space remain. The render tree is a strict subset of the DOM, optimized for layout calculations.

Text nodes in the DOM become text runs in the render tree. Media elements like `` or “ are represented as replaced elements – their intrinsic dimensions (width, height from metadata) are stored. The browser must fetch external resources (images, fonts) during this phase, which can block rendering if not handled asynchronously. Modern browsers use preload scanners to discover such resources early, before the parser reaches them.

Handling Async Media

For images and videos, the browser does not wait for the full file to download before continuing layout. It reserves space based on the `width` and `height` attributes in HTML or the intrinsic size from the file header. If these attributes are missing, the layout shifts once the media loads – a common cause of Cumulative Layout Shift (CLS). Developers should always specify dimensions to avoid jank.

Layout, Paint, and Compositing

After the render tree is built, the browser calculates the exact position and size of each node. This is the layout phase. Text is shaped into glyphs based on font metrics and line-breaking rules. Media elements are placed within their containing blocks. The output is a set of boxes with coordinates, ready for painting.

Painting converts the layout boxes into actual pixels. Text is rasterized, images are decoded, and videos are composited as separate layers. Modern browsers split the page into multiple GPU layers (e.g., one for a fixed header, another for scrolling content). Compositing merges these layers without repainting everything on each frame, which is critical for smooth 60fps playback of video or animated media.

FAQ:

Does the DOM include CSS styles?

No. The DOM only represents the HTML structure. Styles are stored in a separate CSSOM tree, which is merged with the DOM to create the render tree.

Can the DOM be modified after a page loads?

Yes. JavaScript methods like `document.createElement` or `innerHTML` directly mutate the DOM. Each change can trigger a re-layout and re-paint.

Why do images sometimes cause layout shifts?

If an image lacks explicit width/height attributes, the browser cannot reserve space until the file header is parsed. This shifts subsequent content when the image loads.

Is the DOM the same as the source HTML?

Not exactly. The browser normalizes the HTML (fixes missing close tags, adds implicit elements like “). The DOM is a cleaned, live version of the source.

Reviews

Sarah K.

This article clarified the difference between DOM and render tree. I finally understand why my lazy-loaded images cause jumps.

Marcus T.

I used this explanation to debug a slow page. The preload scanner tip alone saved me hours. Practical and precise.

Lena V.

As a junior frontend dev, the breakdown of compositing vs painting was gold. Short but packed with real technical detail.

Web_browsers_parse_the_Document_Object_Model_of_a_web_page_to_render_text_and_media_elements.