But, what about the ringbuffer control registers head, tail, etc.. When you want to submit a workload to the GPU you: A choose your context, B find its appropriate virtualized ring, C write commands to it and then, finally, D tell the GPU to switch to that context. Now that ringbuffers belong per-context and not per-engine, like before and that contexts are uniquely tied to a given engine and not reusable, like before we need:.
The global default context starts its life with these new objects fully allocated and populated. To handle this, we have implemented a deferred creation of LR contexts:. The local context starts its life as a hollow or blank holder, that only gets populated for a given engine once we receive an execbuffer.
This method works as follows:. When a request is committed, its commands the BB start and any leading or trailing commands, like the seqno breadcrumbs are placed in the ringbuffer for the appropriate context. The tail pointer in the hardware context is not updated at this time, but instead, kept by the driver in the ringbuffer structure.
Otherwise the queue will be processed during a context switch interrupt. When execution of a request completes, the GPU updates the context status buffer with a context complete event and generates a context switch interrupt. During the interrupt handling, the driver examines the events in the buffer: for each context complete event, if the announced ID matches that on the head of the request queue, then that request is retired and removed from the queue.
After processing, if any requests were retired and the queue is not empty then a new execution list can be submitted. The two requests at the front of the queue are next to be submitted but since a context may not occur twice in an execution list, if subsequent requests have the same ID as the first then the two requests must be combined. This is done simply by discarding requests at the head of the queue until either only one requests is left in which case we use a NULL second context or the first two requests have unique IDs.
By always executing the first two requests in the queue the driver ensures that the GPU is kept as busy as possible. In the case where a single context completes but a second context is still executing, the request for this second context will be at the head of the queue when we remove the first one. This request will then be resubmitted along with a new request for a different context, which will cause the hardware to continue executing the second request and queue the new request the GPU detects the condition of a context getting preempted with the same context and optimizes the context switch flow by not doing preemption, but just sampling the new tail pointer.
The context descriptor encodes various attributes of a context, including its GTT address and some flags. This view will be called a normal view.
To support multiple views of the same object, where the number of mapped pages is not equal to the backing store, or where the layout of the pages is not linear, concept of a GGTT view was added.
One example of an alternative view is a stereo display driven by a single image. In this case we would have a framebuffer looking like this 2x2 pages :. In contrast, fed to the display engine would be an alternative view which could look something like this:. In this example both the size and layout of pages in the alternative view is different from the normal view. This table is stored in the vma.
The function tries to search if there is an existing PPAT entry which matches with the required value. If perfectly matched, the existing PPAT entry will be used. If only partially matched, it will try to check if there is any available PPAT index. If not, the partially matched entry will be used. If the PPAT index of the entry is dynamically allocated, its reference count will be decreased. Once the reference count becomes into zero, the PPAT index becomes free again.
If the node does not fit, it tries to evict any overlapping nodes from the GTT, including any neighbouring nodes if the colors do not match to ensure guard pages between differing domains.
The hole address is aligned to alignment and its size must then fit entirely within the [ start , end ] bounds. The nodes on either side of the hole must match color , or else a guard page will be inserted between the two nodes or the node evicted. If no suitable hole is found, first a victim is randomly selected and tested for eviction, otherwise then the LRU list of objects within the GTT is scanned to find the first set of replacement nodes to create the hole.
Those old overlapping nodes are evicted from the GTT and so must be rebound before any future use. This function force-removes any fence from the given object, which is useful if the kernel wants to do untiled GTT access.
When mapping objects through the GTT, userspace wants to be able to write to them without having to worry about swizzling if the object is tiled. It is used to reserve fence for vGPU to use. Removes all GTT mmappings via the fence registers. This forces any user of the fence to reacquire that fence before continuing with their access. One use is during GPU reset where the fence register is lost and we need to revoke concurrent userspace access via GTT mmaps until the hardware has been reset and the fence registers have been restored.
Restore the hw fence state to match the software tracking again, to be called after a gpu reset and on resume. Note that on runtime suspend we only cancel the fences, to be reacquired by the user later.
This is called when pinning backing storage again, since the kernel is free to move unpinned backing storage around either by directly moving pages or by swapping them out and back in again. This must be called before the backing storage can be unpinned. Each platform has only a fairly limited set of these objects.
Fences are used to detile GTT memory mappings. Furthermore on older platforms fences are required for tiled objects used by the display engine. The idea behind tiling is to increase cache hit rates by rearranging pixel data so that a group of pixel accesses are in the same cacheline.
Intel architectures make this somewhat more complicated, though, by adjustments made to addressing of data when the memory is in interleaved mode matched pairs of DIMMS to improve memory bandwidth. For interleaved memory, the CPU sends every sequential 64 bytes to an alternate memory channel so it can get the bandwidth from both. The GPU also rearranges its accesses for increased bandwidth to interleaved memory, and it matches what the CPU does for non-tiled.
However, when tiled it does it a little differently, since one walks addresses not just in the X direction but also Y. So, along with alternating channels when bit 6 of the address flips, it also alternates when other bits flip — Bits 9 every bytes, an X tile scanline and 10 every two X tile scanlines are common to both the and class hardware.
The CPU also sometimes XORs in higher bits as well, to improve bandwidth doing strided access like we do so frequently in graphics. All of this bit 6 XORing has an effect on our memory management, as we need to make sure that the 3d driver can correctly address object contents. When bit 17 is XORed in, we simply refuse to tile at all.
Bit 17 is not just a page offset, so as we page an object out and back in, individual pages in it will have different bit 17 addresses, resulting in each 64 bytes being swapped with its neighbor! Return the required global GTT size for a fence view of a tiled object , taking into account potential fence register mapping. Return the required global GTT alignment for a fence a view of a tiled object , taking into account potential fence register mapping.
Sets the tiling mode of an object, returning the required swizzling of bit 6 of addresses in the object. Since neither of this applies for new tiling layouts on modern platforms like W, Ys and Yf tiling GEM only allows object tiling to be set to X or Y tiled. This section documents the interface functions for evicting buffer objects to make space available in the virtual gpu address spaces.
Note that this is mostly orthogonal to shrinking buffer objects caches, which has the goal to make main memory shared with the gpu through the unified memory architecture available. This function will try to evict vmas until a free space satisfying the requirements is found. Callers must check first whether any such hole exists already before calling this function. Since this function is only used to free up virtual address space it only ignores pinned vmas, and not object where the backing storage itself is pinned.
To clarify: This is for freeing up virtual address space, not for freeing memory in e. This section documents the interface function for shrinking memory usage of buffer object caches. Shrinking is used to make main memory available.
Note that this is mostly orthogonal to evicting buffer objects, which has the goal to make space in gpu virtual address spaces. This function is the main interface to the shrinker. It will try to release up to target pages of main memory backing storage from buffer objects.
Selection of the specific caches can be done with flags. This is e. Therefore code that needs to explicitly shrink buffer objects caches e. Also note that any kind of pinning both per-vma address space pins and backing storage pins at the buffer object level result in the shrinker code having to skip the object.
It also first waits for and retires all outstanding requests to also be able to release backing storage for active objects. This should only be used in code to intentionally quiescent the gpu or as a last-ditch effort when memory seems to have run out.
The main action required here it to load the GuC uCode into the device. This struct is the owner of a doorbell, a process descriptor and a workqueue all of them inside a single gem object that contains all required pages for these elements. GuC stage descriptor: During initialization, the driver allocates a static pool of such descriptors, and shares them with the GuC.
This stage descriptor lets the GuC know about the doorbell, workqueue and process descriptor. It then triggers an interrupt on the GuC via another register write 0xC4C8. The kernel driver polls waiting for this update and then proceeds.
Doorbells: Doorbells are interrupts to uKernel. A doorbell is a single cache line QW mapped into process space. Work Items: There are several types of work items that the host may place into a workqueue, each with its own requirements and limitations.
The firmware may or may not have modulus key and exponent data. The header, uCode and RSA signature are must-have components that will be used by driver. Length of each components, which is all in dwords, can be found in header.
In the case that modulus and exponent are not present in fw, a. HuC firmware css header is different. However, the only difference is where the version information is saved. With full ppgtt enabled each process using drm will allocate at least one translation table. These tracepoints are used to track creation and deletion of contexts. If full ppgtt is enabled, they also print the address of the vm assigned to the context. This tracepoint allows tracking of the mm switch, which is an important point in the lifetime of the vm in the legacy submission path.
This tracepoint is called only if full ppgtt is enabled. Gen graphics supports a large number of performance counters that can help driver and application developers understand and optimize their use of the GPU.
This i perf interface enables userspace to configure and open a file descriptor representing a stream of GPU metrics which can then be read as a stream of sample records.
Streams representing a single context are accessible to applications with a corresponding drm file descriptor, such that OpenGL can use the interface without special privileges.
Access to system-wide metrics requires root privileges by default, unless changed via the dev. The interface was initially inspired by the core Perf infrastructure but some notable differences are:. Samples for an i perf stream capturing OA metrics will include a set of counter values packed in a compact HW specific format. The OA unit supports a number of different packing formats which can be selected by the user opening the stream. Try the Automatic Driver Update Utility, or you can request a driver and we will find it for you.
Intel Lan Driver Download DriverGuide maintains an extensive archive of Windows drivers available for free download.
That's my good deed of the year. Phew, I get to have a long rest until next year's deed. This function is called at the i driver unloading stage, to shutdown GVT components and release the related resources.
This file is intended as a central place to implement most [1] of the required workarounds for hardware to work as originally intended. Keep things in this file ordered by WA type, as per the above context, GT, display, register whitelist, batchbuffer.
Then, inside each type, keep the following order:. This section covers everything related to the display hardware including the mode setting infrastructure, plane, sprite and cursor handling and display, output probing and related topics.
The i driver is thus far the only DRM driver which doesn't use the common DRM helper code to implement mode setting sequences. Thus it has its own tailor-made infrastructure for executing a display configuration change. Many features require us to track changes to the currently active frontbuffer, especially rendering targeted at the frontbuffer.
The function in this file are then called when the contents of the frontbuffer are invalidated, when frontbuffer rendering has stopped again to flush out all the changes and when the frontbuffer is exchanged with a flip. Subsystems interested in frontbuffer changes e. On a high level there are two types of powersaving features.
The first one work like a special cache FBC and PSR and are interested when they should stop caching and when to restart caching. This is done by placing callbacks into the invalidate and the flush functions: At invalidate the caching must be stopped and at flush time it can be restarted.
And maybe they need to know when the frontbuffer changes e. The other type of display power saving feature only cares about busyness e. In that case all three invalidate, flush and flip indicate busyness. There is no direct way to detect idleness. Instead an idle timer work delayed work should be started from the flush and flip functions and cancelled as soon as busyness is detected.
This function gets called every time rendering on the given object starts and frontbuffer caching fbc, low refresh rate for DRRS, panel self refresh must be invalidated. This function gets called every time rendering on the given object has completed and frontbuffer caching can be started again. This function gets called every time rendering on the given planes has completed and frontbuffer caching can be started again.
Flushes will get delayed if they're blocked by some outstanding asynchronous rendering. This function gets called after scheduling a flip on obj. If an invalidate happens in between this flush will be cancelled. This function gets called after the flip has been latched and will complete on the next vblank. It will execute the flush if it hasn't been cancelled yet. This is for synchronous plane updates which will happen on the next vblank and which will not get delayed by pending gpu rendering.
Both old and new can be NULL. The i driver checks for display fifo underruns using the interrupt signals provided by the hardware. This is enabled by default and fairly useful to debug display issues, especially watermark settings. If an underrun is detected this is logged into dmesg. To avoid flooding logs and occupying the cpu underrun interrupts are disabled after the first occurrence until the next modeset on a given pipe.
Note that underrun detection on gmch platforms is a bit more ugly since there is no interrupt despite that the signalling bit is in the PIPESTAT pipe interrupt register.
Also on some other platforms underrun interrupts are shared, which means that if we detect an underrun we need to disable underrun reporting on all pipes.
This function sets the fifo underrun state for pipe. It is used in the modeset code to avoid false positives since on many platforms underruns are expected when disabling or enabling the pipe. Notice that on some platforms disabling underrun reports for one pipe disables for all due to shared interrupts.
Actual reporting is still per-pipe though. Notice that on some PCHs e. This handles a CPU fifo underrun interrupt, generating an underrun warning into dmesg if underrun reporting is enabled and then disables the underrun interrupt to avoid an irq storm. This handles a PCH fifo underrun interrupt, generating an underrun warning into dmesg if underrun reporting is enabled and then disables the underrun interrupt to avoid an irq storm.
Check for CPU fifo underruns immediately. Check for PCH fifo underruns immediately. This section covers plane configuration and composition with the primary plane, sprites, cursors and overlays. This includes the infrastructure to do atomic vsync'ed updates of all this state and also tightly coupled topics like watermark setup and computation, framebuffer compression and panel self refresh.
The functions here are used by the atomic plane helper functions to implement legacy plane updates i. Allocates and returns a copy of the plane state both common and Intel-specific for the specified plane. This section covers output probing and related infrastructure like the hotplug interrupt storm detection and mitigation code.
Note that the i driver still uses most of the common DRM helper code for output probing, so those sections fully apply. Simply put, hotplug occurs when a display is connected to or disconnected from the system. However, there may be adapters and docking stations and Display Port short pulses and MST devices involved, complicating matters. The interrupt handlers gather the hotplug detect HPD information from relevant registers into a platform independent mask of hotplug pins that have fired. Finally, the userspace is responsible for triggering a modeset upon receiving the hotplug uevent, disabling or enabling the crtc as needed.
The hotplug interrupt storm detection and mitigation code keeps track of the number of interrupts per hotplug pin per a period of time, and if the number of interrupts exceeds a certain threshold, the interrupt is disabled for a while before being re-enabled. The intention is to mitigate issues raising from broken hardware triggering massive amounts of interrupts and grinding the system to a halt. Only the pin specific stats and state are changed, the caller is responsible for further action.
However, some older systems also suffer from short IRQ storms and must also track these. This is the main hotplug irq handler for all platforms. Here, we do hotplug irq storm detection and mitigation, and pass further processing to appropriate bottom halves. This function enables the hotplug support. From this point on hotplug and poll request can run concurrently to other code, so locking rules must be obeyed.
This is a separate step from interrupt enabling to simplify the locking rules in the driver load and resume code. This function enables polling for all connectors, regardless of whether or not they support hotplug detection. Under certain conditions HPD may not be functional. On most Intel GPUs, this happens when we enter runtime suspend. On Valleyview and Cherryview systems, this also happens when we shut off all of the powerwells.
The audio programming sequences are divided into audio codec and controller enable and disable sequences. The graphics driver handles the audio codec sequences, while the audio driver handles the audio controller sequences. The disable sequences must be performed before disabling the transcoder or port.
The enable sequences may only be performed after enabling the transcoder and port, and after completed link training. Indeed, most of the co-operation between the graphics and audio drivers is handled via audio related registers.
The notable exception is the power management, not covered here. The master can then start to use the interface defined by this struct. Each side can break the binding at any point by deregistering its own component after which each side's component unbind callback is called. We ignore any error during registration and continue with reduced functionality i. Motivation: Atom platforms e.
The interface is handled by a separate standalone driver maintained in the ALSA subsystem for simplicity. To minimize the interaction between the two subsystems, a bridge is setup between the hdmi-lpe-audio and i 1. Make the platform device child of i device for runtime PM.
Threats: Due to the restriction in Linux platform device model, user need manually uninstall the hdmi-lpe-audio driver before uninstalling i module, otherwise we might run into use-after-free issues after i removes the platform device: even though hdmi-lpe-audio driver is released, the modules is still in "installed" status.
PSR feature allows the display to go to lower standby states when system is idle but display is on as it eliminates display refresh request to DDR memory completely as long as the frame buffer for that display is unchanged. PSR saves power by caching the framebuffer in the panel RFB, which allows us to power down the link and memory controller. For DSI panels the same idea is called "manual mode". The hardware takes care of sending the required DP aux message and could even retrain the link that part isn't enabled yet though.
The hardware also keeps track of any frontbuffer changes to know when to exit self-refresh mode again. Unfortunately that part doesn't work too well, hence why the i PSR support uses the software frontbuffer tracking to make sure it doesn't miss a screen update. Since the hardware frontbuffer tracking has gaps we need to integrate with the software frontbuffer tracking.
This function gets called every time frontbuffer rendering starts and a buffer gets dirtied. This function gets called every time frontbuffer rendering has completed and flushed out to memory. FBC tries to save memory bandwidth and so power consumption by compressing the amount of memory used by the display. It is total transparent to user space and completely handled in the kernel.
The benefits of FBC are mostly visible with solid backgrounds and variation-less patterns. It comes from keeping the memory footprint small and having fewer memory pages opened and accessed for refreshing the display.
However there are many known cases where we have to forcibly disable it to allow proper screen updates. FIXME: This should be tracked in the plane config eventually instead of queried at runtime for most callers. Notice that it doesn't activate FBC.
Without FBC, most underruns are harmless and don't really cause too many problems, except for an annoying message on dmesg.
With FBC, underruns can become black screens or even worse, especially when paired with bad watermarks. An underrun on any pipe already suggests that watermarks may be bad, so try to be as safe as possible. This function does the initial setup at driver load to make sure FBC is matching the real hardware. Display Refresh Rate Switching DRRS is a power conservation feature which enables swtching between low and high refresh rates, dynamically, based on the usage scenario.
This feature is applicable for internal panels. DRRS is of 2 types - static and seamless. Static DRRS involves changing refresh rate RR by doing a full modeset may appear as a blink on screen and is used in dock-undock scenario.
This is done by programming certain registers. The implementation is based on frontbuffer tracking implementation. When there is no movement on screen, after a timeout of 1 second, a switch to low RR is made. DRRS can be further extended to support other internal panels and also the scenario of video playback wherein RR is set based on the rate requested by userspace.
This function gets called when refresh rate RR has to be changed from one frequency to another. Switches can be between high and low RR supported by the panel or to any other RR based on media playback in this case, RR value needs to be passed from user space.
This function gets called everytime rendering on the given planes start. This function gets called every time rendering on the given planes has completed or flip on a crtc is completed. And also Idleness detection should be started again, if no other planes are dirty. Downclock mode if panel supports it, else return NULL. Each display PHY is made up of one or two channels.
Each channel houses a common lane part which contains the PLL and other common logic. In addition to having their own registers, the PHYs are also controlled through some dedicated signals from the display controller.
This is especially important when we cross the streams ie. That means the PLL is also now associated with the port rather than the pipe, and so the clock needs to be routed to the appropriate transcoder.
Display Context Save and Restore CSR firmware support added from gen9 onwards to drive newly added DMC Display microcontroller in display engine to save and restore the state of display engine when it enter into low-power state and comes back to normal.
CSR firmware is read from a. Everytime display comes back from low power state this function is called to copy the firmware from internal memory to registers. This function is called at the time of loading the display driver to read firmware from a. Prepare the DMC firmware before entering system suspend.
This includes flushing pending work items and releasing any resources acquired during init. The configuration is mostly related to display hardware. The data blocks are concatenated after the BDB Header.
The driver parses the VBT during load. The relevant information is stored in driver private data for ease of use, and the actual VBT is not read after that. Also initialize some defaults if the VBT is not present at all. Return true if LVDS is present. Return true if the device in port is present. Return true if the device in port is eDP.
Return true if DSI is present, and return the port in port. Return true if HPD should be inverted for port. The display engine uses several different clocks to do its work. CDCLK clocks most of the display pipe logic, and thus its frequency must be high enough to support the rate at which pixels are flowing through the pipes.
Downscaling must also be accounted as that increases the effective pixel rate. Sign up using Facebook. Sign up using Email and Password. Post as a guest Name. Email Required, but never shown. The Overflow Blog. Stack Gives Back Safety in numbers: crowdsourcing data on nefarious IP addresses. Featured on Meta. New post summary designs on greatest hits now, everywhere else eventually.
Linked 0.
0コメント