Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Element Capture #73

Closed
eladalon1983 opened this issue Oct 19, 2022 · 29 comments
Closed

Element Capture #73

eladalon1983 opened this issue Oct 19, 2022 · 29 comments

Comments

@eladalon1983
Copy link

eladalon1983 commented Oct 19, 2022

TL;DR

The proposed API will allow a website to capture an HTML element as a video stream. Only the target element and its descendant elements will be captured. Parent and sibling elements will not be captured, even if they draw over/under the target element.

partial interface HTMLElement {
    Promise<MediaStream> capture(optional double frameRequestRate);
};

In a way, this can be thought of as similar to HTMLCanvasElement.captureStream(), expanded to any HTMLElement, with some additional security gating added to address the concern of leaking cross-origin content.

Introduction

State of the Art

The Web Platform currently offers the ability to screen-capture using getDisplayMedia(). The resulting MediaStream, composed of video and potentially also audio, can be manipulated locally and/or transmitted remotely. Some common use-cases include:

  1. Video conferencing
  2. Filing feedback
  3. Storing the client-side rendering of a Web-app's audio/visual output

It is also possible to crop the resulting video using Region Capture. This is useful in multiple scenarios:

  1. When video conferencing, cropping allows embedding the call in the same tab as captured content, without capturing the video conference itself.
  2. When filing feedback, cropping helps remove irrelevant, and potentially private information, from the report.
  3. When producing local recordings of an app's output, cropping lets the app remove its own UI elements if it deems them irrelevant to the captured content. For example, video-editing software can crop away the progress bar.

Issues

The main technique available nowadays is to use getDisplayMedia() and Region Capture; in some ways, it's the only technique. One major downside is that this captures both occluding as well as occluded content.

Consider this example, where the progress bar is occluding content which is unintentionally captured:

Unintentionally capturing occluding content is usually the problem, but occluded content is also a concern if the target element has is partially transparent or has bevelled edges.

Proposed Solution

Add a method along the lines of:

partial interface HTMLElement {
    Promise<MediaStream> capture(optional double frameRequestRate);
};

Initially, scope the discussion to a MediaStream with a single video MediaStreamTrack. Audio is an obvious follow-up, to be discussed later.

Successful invocations of this method are only possible from contexts that have:

  1. Cross-origin isolation.
  2. All captured content has opted in to being captured via a bespoke Required Document Policy. (That is, the target element and its descendants.)
  3. Possibly some user prompt; to be discussed.

The second requirement could be tricky, as sibling elements could affect the shape of the target element, and so information could leak. As a start, we could go with requiring opt-in from all of the content in the page, and later explore if this requirement can be safely reduced in scope for elements whose rendering is unaffected by adjacent elements, perhaps via some CSS property or some other sandboxing property.

The necessity of a user prompt is debatable. Ideally, requirements 1 and 2 should ensure that any content captured, is either already known to the capturing page, or can be known to it via communication with embedded documents. (If those documents opted in to being captured, it is arguable that they would be willing to transmit such information.) However, some non-DOM information could leak, like link-purpling. It is therefore best to start with a required user prompt, and re-examine that requirement later.

Privacy & Security Considerations

Pending.

@eladalon1983
Copy link
Author

@florentpergoud
Copy link

It could also help for tools like https://www.chromatic.com/ for visual regression testing !

@maxweber
Copy link

maxweber commented Oct 20, 2022

We use the browser to create videos for Storrito.com and Audiocado.com, the described feature would have saved me many headaches.

I would prefer access to an element's raw pixel buffer. Converting this to JPG, PNG, or compressed video frame beforehand would significantly slow down all video rendering efforts.

@JonnyBurger
Copy link

I am also very interested in this API and second that it will be important to get the frame in an uncompressed way.
For Remotion, we will be able to provide in-browser video rendering if this API ships and generally, this would open the door for web-based offline video editors to become more powerful (the status quo is that only canvas-based graphics can be exported as images).

@arpitmittaldev
Copy link

arpitmittaldev commented Oct 21, 2022

This is going to be a big thing. Because there are many websites which are trying to capture elements for generating images like I am doing https://startupholic.com/post-thumbnail-maker or can be used to create images (For now, I have been capturing image by some work around using canvas.)

@Awendel
Copy link

Awendel commented Oct 23, 2022

The proposed API would also work very well together with my pixel-density property proposal:
#74

This would allow an element to be captured as a raw pixel buffer, but also having the ability to scale the resolution up or down.
(by setting the pixel density before capturing the element)
So an element that has only 200x500 screen pixels could be captured at a much higher resolution (e.g. 600x3000) for much higher quality exports.

This would also work well with designers workflow of exporting assets at different resolutions for different DPI environments (usually 0.5x, 1x, 2x, 4x)

@yoavweiss
Copy link
Collaborator

A repo is now live at https://github.com/WICG/element-capture

Happy incubating!!

@esprehn
Copy link

esprehn commented Oct 25, 2022

It's a small detail, but this should probably be a global function (ex. getElementCapture) or have a much less generic (and longer name). Adding a property to HTMLElement prevents adding a reflected attribute named capture to any element forever. In general adding things high up in the class hierarchy is not great for long term evolution of the web.

My suggestion is to extend getDisplayMedia to accept an element option. Capturing an element or the screen/tab is pretty much the same thing security wise. Since this will prompt anyway, that's a small surface area change.

@AdamJaggard
Copy link

I'm a fan of extending getDisplayMedia from a DX and UX perspective. Users are familiar with it from a permissions flow. I have use cases where I want to constrain the users choice to only share a specific DOM tree, providing a reference to an element to getDisplayMedia feels very natural.

@eladalon1983
Copy link
Author

eladalon1983 commented Oct 28, 2022

It's a small detail, but this should probably be a global function

One possibility I would like would be navigator.mediaDevices.getElementMedia(), to keep it close to getDisplayMedia.

I'm a fan of extending getDisplayMedia from a DX and UX perspective. Users are familiar with it from a permissions flow. I have use cases where I want to constrain the users choice to only share a specific DOM tree, providing a reference to an element to getDisplayMedia feels very natural.

Communicating to the user what kind of permissions they would be giving is going to be quite tricky whichever way we choose, because the nature and contents of an Element can change; think the contents of a div or iframe, for instance. For that reason, I am mostly thinking about prompting the user for permission to capture the entire tab, and then extrapolating from that permission a permission to capture occluded elements. The delta in permission seems manage-able to me. I intend to do a write-up on that. If we do end up going with that approach, then after prompting the user once to share their entire tab, you'd be able to capture any and all elements.

@Awendel
Copy link

Awendel commented Oct 29, 2022

Why would this API even need permissions in the first place?

Unless it is for like a 3rd party iframe / cross origin resource.

Since I could just as well replicate the functionality with a library like html2canvas and get the underlying pixel data without having to ask for permissions.

I understand it for if you share your Desktop etc that goes beyond just the current browser window, but since all HTML data etc is already accessible, why would getting the pixel data of an element suddenly be an elevated security risk?

For a lot of use cases like capturing a preview or something, it would be very strange for a user to be suddenly prompted with a security warning, which might induce FUD for nor reason.

@Awendel
Copy link

Awendel commented Oct 29, 2022

Just to clarify further, one of the main use cases for the API would be to generate previews of content
(think Google Docs, Figma File etc.)
This currently is only possible server side.

If this API should truly replace that need for doing it server side, then permissions are the wrong approach.
Since what if a user denies it and this breaks you workflow?
Then you'd still have to implement it server side, which defeats the whole point of an API

To summarize:
Permissions only make sense for cross origin content.
Since all DOM data is already accessible to the developer and could be used in a custom rasterizer (such as html2canvas), adding permissions to this makes no sense, because a developer could circumvent them anyway going the latter approach, albeit with more effort.

@AdamJaggard
Copy link

I could just as well replicate the functionality with a library like html2canvas and get the underlying pixel data

For anyone not aware it's worth mentioning html2canvas only attempts to re-draw the DOM using it's own heuristics and drawing tools available in the canvas. It has no way of asking the browser for an image for any portion of the page. It is imperfect for many edge cases and requires work-arounds if the content it's drawing has anything cross-origin anywhere in the tree it's trying to draw (you can taint the canvas very easily).

From the users perspective, this feature adds the ability to create a live video stream of any part of the page and stream it out to anywhere. As a user, if I visit a webpage, I don't expect a livestream of what I'm looking at to start happening without my awareness.

So for me the crux of it is that

  • No solution currently exists to perfectly capture any portion of the page, even with permission. Only approximations can be made, and only static images realistically
  • This feature would break most casual users expectation of what is possible without their awareness; a live video stream capturing what they are looking at

With the prevalence of 3rd party scripts being included on websites these days, including the trend of serving these from 1st party domains to avoid blockers, giving all of these scripts the ability to flip on a livestream without intervention feels like a step too far.

@Awendel
Copy link

Awendel commented Oct 29, 2022

A script could just as easily record snapshots of document.body.innerHTML and "replay" the session on a backend if it had malicious intent, without the user ever knowing.

I really fail to see any added security benefit from asking for permissions.

Any data point that would go into creating a screenshot / video is already available to the browser (via parsing .getBoundingClientRect and computed style, as well as capturing event Listeners).

@Awendel
Copy link

Awendel commented Oct 29, 2022

The modern internet is already a nightmare in terms of forced cookie banners on every site that require user consent.
Adding on top of that more popups / prompts for security permissions should only be done if it opens up capabilities otherwise sandboxed by the browser and not for things that are already available / could be easily replicated.

@AdamJaggard
Copy link

Understand where you're coming from, I think it's a very good point. From a pure security perspective this doesn't expose data not already exposed in some other fashion to the code running on the page (I think?). In that sense, I'm on the same page.

For me it comes down to the user being aware of what is happening / understanding what the browser is doing on their behalf (facilitating the video capture). That could be a permission flow, it could also be a box around the streamed area or a recording icon in the tab.

Without any indication that it's happening I would personally just disable the feature. I realise I might be an outlier here though 😛

@Awendel
Copy link

Awendel commented Oct 30, 2022

If the tab indicates that a capture is happening (especially for longer time frames such as a video stream), then that could be a good non-invasive compromise without disrupting user / developer flow.

Similar to how its done for Tabs that play audio / capture webcam etc.

In terms of API, the other main important thing is compression.

It will be crucial to expose an API that is able to capture an uncompressed screenshot (png / uint8Array).
In terms of video, it becomes more difficult, since holding long uncompressed videos in memory can quickly overflow system memory (e.g. Blender exports all video frames as png to disk in a first step before "assembling" them to a video with whatever level of compression chosen.

Forcing compression on the video stream would be a solution, but then would break its use cases for exporting high quality video exports (think video editor etc.)

@JonnyBurger
Copy link

JonnyBurger commented Oct 30, 2022

Regarding the security model:
Preventions need to be put into place that a website cannot for example load a third party <iframe> into the DOM and then capture a screenshot from the perspective of the user.

For the canvas element, as soon as a third party resource gets used to draw on a canvas (for example loading a network image without CORS), the canvas gets "tainted" and it is not possible to grab an image from it anymore.

If it's hard for browsers to handle these security concerns, then a permission prompt should be introduced, if the browser already has a notion of whether the DOM is tainted by third party content, then it would be cool to use it and allow capture of locally generated content without permission prompt.

@eladalon1983
Copy link
Author

@Awendel:

  • The presence of cross-origin content is quite common, and must be accounted for. Furthermore, cross-origin content can be loaded after the capture starts.
  • You can capture a single frame by plugging the track into a MediaStreamTrackProcessor.

@JonnyBurger, the proposal assumes (i) cross-origin isolation and (2) an opt-in header, likely in the form of a Required Document Policy. At the moment, this is briefly covered in the original message in this thread. I will expand on this later.

To ensure good velocity, I am starting out with the assumption of a prompt.

@eladalon1983
Copy link
Author

Having thought about this in the meantime, another approach occurs to me, which is relevant if the user has already initiated capture using pre-existing means such as getDisplayMedia(). In that case, the app already holds a video track, and that track can be mutated.

So where we already have Region Capture mutate an existing video track:

track.cropTo(cropTarget);  // Region Caputre

We could now have:

track.restrictCaptureTo(cropTarget);  // Element Capture

The resulting track would behave the same as previously, but:

  1. Adoption by applications will be much easier, since no new requirements would be introduced (no requirement of cross-origin isolation, etc.)
  2. Clear path for transitioning a given capture between different targets without reprompting the user.
  3. Element Capture of an element in another tab now possible.
  4. Audio initially outscoped, but the path forward is clear - mutate the audio track in similar fashion. This path has more flexible, as the video and audio don't have to be "focused" on the same target.

@AdamJaggard
Copy link

This path has more flexible, as the video and audio don't have to be "focused" on the same target.

This bit is very interesting. I'm imagining capturing an element that represents a video game being played, say a canvas, but I need to define where the audio should be captured from. That could be a seperate DOM tree that contains some <audio> elements or it could be "capture audio from the whole tab" and I'll play sounds programatically with JS.

Maybe audio source can be the second argument? track.restrictCaptureTo(cropTarget, audioSource); with audioSource being either a reference to an element, (an existing AudioTrack?) or the window (or whatever makes more sense to say "everything").

@eladalon1983
Copy link
Author

eladalon1983 commented Jan 16, 2023

Maybe audio source can be the second argument? track.restrictCaptureTo(cropTarget, audioSource); with audioSource being either a reference to an element, (an existing AudioTrack?) or the window (or whatever makes more sense to say "everything").

There are separate tracks for audio and video, so I think it's enough to audioTrack.restrictCaptureTo(x) and videoTrack.restrictCaptureTo(y), with either distinct or identical x and y as the need arises. (Implementation-wise, I foresee Chrome focusing on video first.)

@AdamJaggard
Copy link

Ah of course, I forgot for a sec we are operating on the _video_Track

@gut4
Copy link

gut4 commented Jan 29, 2023

@eladalon1983

track.restrictCaptureTo(cropTarget); // Element Capture

It's interesting but with this approach it will be difficult to capture multiple elements.

There is draft of getDisplayMediaSet for capturing multiple display surfaces but it's about capturing multiple tabs or windows not multiple cropTargets in current tab

@eladalon1983
Copy link
Author

eladalon1983 commented Jan 30, 2023

It's interesting but with this approach it will be difficult to capture multiple elements.

You could:

const track2 = track1.clone();
track1.restrictCaptureTo(cropTarget1);
track2.restrictCaptureTo(cropTarget2);

Implementation-wise, this will require some work in Chrome to support. But in terms of API shape, this should work.

@Zubnix
Copy link

Zubnix commented Mar 27, 2023

There's also a more general requirement to be able to rasterize DOM elements without the output format being video.
By being able to rasterize to eg ImageBitmap first, one can still feed the image to a MediaStream using other existing APIs to achieve the same effect as this proposal.

I realise semantics and use cases can be different as the rasterized dom elements can be updated at any time which is automatically handled by the MediaStream but I guarantee that offering just a MediaStream API it will be (ab)used for the [snapshot] rasterizing use-case given the plethora of hacky solutions that exist:

Something to think about?

@eladalon1983
Copy link
Author

There's also a more general requirement to be able to rasterize DOM elements without the output format being video. By being able to rasterize to eg ImageBitmap first, one can still feed the image to a MediaStream using other existing APIs to achieve the same effect as this proposal.

I realise semantics and use cases can be different as the rasterized dom elements can be updated at any time which is automatically handled by the MediaStream but I guarantee that offering just a MediaStream API it will be (ab)used for the [snapshot] rasterizing use-case given the plethora of hacky solutions that exist:

Something to think about?

Definitely screenshot-production is an important use-case that interests many (me included). I think we should keep in mind that:

  1. The currently proposed API shape relies on pre-existing means for obtaining the user's permission to capture cross-origin content. Moving away from this pre-existing means would require coming up with something new. The complexity involved should not be underestimated.
  2. There is nothing "abusive" in grabbing individual frames out of a video stream. It's a legitimate use of the API.
  3. It is much easier to grab single frames out of a video stream, than to compose multiple images into a video stream. Think of the individual timestamps, a/v sync, and the requirement to obtain the user's permission 30 times per second... :-P

@Zubnix
Copy link

Zubnix commented Apr 10, 2023

There's also a more general requirement to be able to rasterize DOM elements without the output format being video. By being able to rasterize to eg ImageBitmap first, one can still feed the image to a MediaStream using other existing APIs to achieve the same effect as this proposal.
I realise semantics and use cases can be different as the rasterized dom elements can be updated at any time which is automatically handled by the MediaStream but I guarantee that offering just a MediaStream API it will be (ab)used for the [snapshot] rasterizing use-case given the plethora of hacky solutions that exist:

Something to think about?

Definitely screenshot-production is an important use-case that interests many (me included). I think we should keep in mind that:

  1. The currently proposed API shape relies on pre-existing means for obtaining the user's permission to capture cross-origin content. Moving away from this pre-existing means would require coming up with something new. The complexity involved should not be underestimated.
  2. There is nothing "abusive" in grabbing individual frames out of a video stream. It's a legitimate use of the API.
  3. It is much easier to grab single frames out of a video stream, than to compose multiple images into a video stream. Think of the individual timestamps, a/v sync, and the requirement to obtain the user's permission 30 times per second... :-P

1 is a valid point, however for 2 & 3:
It would be a waste of CPU and slow as well to encode a frame x times/second, just to immediately decode just a single frame, and with visual loss. You'd also lose the alpha channel because video codecs don't support this. That's the 'abusive' part of this, you'd be using something for something it was never designed to do.

Perhaps the option of a 'raw' video stream with raw rgba pixel 'encoding' and a single frame (0fps options) would allow for greater freedom of use here? It would adhere to the spec proposal, would be element capture api compatible and solve the rasterization use-case.

@eladalon1983
Copy link
Author

eladalon1983 commented May 4, 2023

It would be a waste of CPU and slow as well to encode a frame x times/second

Indeed. And that's why I had previously proposed a dedicated API for screenshots. And such an API could be extended to support capturing an element. Until such a screenshot-grabbing API is introduced, the current Element Capture API can be used to polyfill, but it's of course imperfect when tackling missions other than that to which it is tailored (video).

and with visual loss

Why would there be visual loss? There should be no lossy encoding involved.

You'd also lose the alpha channel because video codecs don't support this.

That's an interesting consideration. Thanks for raising it. I'll keep it in mind.

That's the 'abusive' part of this, you'd be using something for something it was never designed to do.

I don't think the word "abusive" applies here. I would associate it more with the idea that the way the API is used is harmful to the user, which implies the misuse should somehow be blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests