Simon Bisson
Contributing Writer

Taking advantage of Microsoft Edgeโ€™s built-in AI

analysis
Jun 19, 20258 mins

Why use expensive AI inferencing services in the cloud when you can use a small language model in your web browser?

SLM
Credit: Shutterstock/Krot_Studio

Large language models are a useful tool, but theyโ€™re overkill for much of what we do with services like ChatGPT. Summarizing text, rewriting our words, even responding to basic chatbot prompts are tasks that donโ€™t need the power of an LLM and the associated compute, power, and cooling of a modern inferencing data center.

There is an alternative: small language models. SLMs like Microsoftโ€™s Phi can produce reliable results with much fewer resources, as theyโ€™re trained with fewer parameters. One of the latest Phi models, Phi-4-mini-instruct, has 3.5 billion parameters trained on five trillion tokens. SLMs like Phi-4-mini-instruct are designed to run on edge hardware, taking language generation to PCs and small servers.

Microsoft has been investing in the development and deployment of SLMs, building its PC-based inferencing architecture on them and using ONNX runtimes with GPUs and NPUs. The downside is that downloading and installing a new model can take time, and the one you want your code to use may not be installed on a userโ€™s PC. This can be quite the hurdle to overcome, even with Windows bundling Phi Silica with its Copilot+ PCs.

Whatโ€™s needed is a way to deliver AI functions in a trusted form that offers the same APIs and features wherever you want to run it. The logical place for this is in the browser, as we do much of our day-to-day work with one, filling in forms, editing text, and working with content from inside and outside our businesses.

An AI model in the browser

A new feature being trialed in the Dev and Canary builds of Microsoftโ€™s Edge browser provides new AI APIs for working with text content, hosting Phi-4-mini in the browser. Thereโ€™s no need to expect users to spend time setting up either WebNN or WebGPU, or even WebAssembly, or requiring them to preload models and have the right security permissions in place for you to call the model and run a local inferencing instance.

There are other advantages. By running models locally you save money. You donโ€™t need the expensive cloud inferencing subscription using GPT or similar. By keeping inferencing local, youโ€™re also keeping user data private; itโ€™s not transferred over the network and itโ€™s not used to train models (a process that can lead to accidental leaks of personally identifiable information).

The browser itself hosts the model, downloading it and updating it as needed. Your code simply needs to initialize the model (the browser automatically downloads it if necessary) and then calls JavaScript APIs to manage basic AI functions. Currently the preview APIs offer four text-based services: summarizing text, writing and rewriting text, and basic prompt evaluation. There are plans to add support for translation services in a future release.

Getting started with Phi in Edge

Getting started is easy enough. You need to set Edge feature flags in either the Canary or Dev builds of Edge for each of the four services, restarting the browser once theyโ€™re enabled. You can then open the sample playground web application to first download the Phi model and then start experimenting with the APIs. It can take some time to download the model, so be prepared for a wait.

Be aware that there are a few bugs at this stage of development. The sample web application stopped updating the download progress counter roughly halfway through the process, but switching to a different API view showed that the installation was complete and I could try out the samples.

Once downloaded, the model is available for all AI API applications, and downloads only when an updated version is released. It runs locally so thereโ€™s no dependency on the network; it can be used with little or no connectivity.

The test pages are basic HTML forms. The Prompt API sample has two fields for setting up user and system prompts, as well as a JSON format constraint schema. For example, the initial sample produces a sentiment analysis for a review web application. The sample constraints ensure that the output is a JSON document containing only the sentiment and the confidence level.

With the model running in the browser and without the same level of protection as the larger-scale LLMs running in Azure AI Foundry, having a well-written system prompt and an associated constraint schema are essential to building a trustworthy in-browser AI application. You should avoid using open-ended prompts, which can lead to errors. By focusing on specific queries (for example, determining sentiment), itโ€™s possible to keep risk to a minimum, ensuring the model operates in a constrained semantic space.

Using constraints to restrict the format of the SLM output makes it easier to use in-browser AI as part of an application, for example, using numeric values or simple text responses as the basis for a graphical UI. Our hypothetical sentiment application could perhaps display a red icon beside negative sentiment content, allowing a worker to analyze it further.

Using Edgeโ€™s experimental AI APIs

Edgeโ€™s AI APIs are experimental, so expect them to change, especially if they become part of the Chromium browser platform. For now, however, youโ€™re able to quickly add support in your pages, using JavaScript and the Edge-specific LanguageModel object.

Any code needs to first check for API support before checking that the Phi model is available. The same call looks for whether the model is present or not or if itโ€™s currently being downloaded. Once a download has been completed, you can load it into memory and start inference. Creating a new session is an asynchronous process that allows you to monitor download progress, ensuring the model is in place and that users are aware of how long it will take to download several gigabytes of model and data.

Once the model is downloaded, start by defining a session and giving it a system prompt. This sets the baseline for any interactions and establishes the overall context for an inference. At the same time, you can use a technique called โ€œN-shot promptingโ€ to provide structure to outputs by providing a set of defined prompts and their expected responses. Other tuning options define limits for how text is generated and how random the outputs are. Sessions can be cloned if you need to reuse the prompts without reloading a page. You should destroy any sessions when the host page is closed.

With the model configured, you can now deliver a user prompt. This can be streamed so you can watch output tokens being generated or simply delivered via an asynchronous call. This last option is the most likely, especially if you will be processing the output for display. If you are using response constraints, these are delivered alongside the prompt. Constraints can be JSON or regular expressions.

If you intend to use the Writing Assistant APIs, the process is similar. Again, you need to check if the API feature flags have been enabled. Opening a new session either uses the copy of Phi thatโ€™s already been downloaded or starts the download process. Each API has a different set of options, such as setting the type and length of a summary or the tone of a piece of writing. You can choose the output type, either plain text or markdown.

CPU, GPU, or NPU?

Testing the sample Prompt API playground on a Copilot+ PC shows that, for now at least, Edge is not using Windowโ€™s NPU support. Instead, the Windows Task Manager performance indicators show that Edgeโ€™s Phi model runs on the deviceโ€™s GPU. At this early stage in development, it makes sense to take a GPU-only approach as more PCs will support itโ€”especially the PCs used by the target developer audience.

Itโ€™s likely that Microsoft will move to supporting both GPU and NPU inference as more PCs add inferencing accelerators and once the Windows ML APIs are finished. Windows MLโ€™s common ONNX APIs for CPU, GPU, and NPU are a logical target for Edgeโ€™s APIs, especially if Microsoft prepares its models for all the target environments, including Arm, Intel, and AMD NPUs.

Windows ML provides tools for Edgeโ€™s developers to first test for appropriate inferencing hardware and then download optimized models. As this process can be automated, it seems ideal for web-based AI applications where their developers have no visibility into the underlying hardware.

Microsoftโ€™s Windows-based AI announcements at Build 2025 provide enough of the necessary scaffolding that bundling AI tools in the browser makes a lot of sense. You need a trusted, secure platform to host edge inferencing, one where you know that the hardware is able to support a model and where one standard set of APIs ensures you only have to write code once to have it run anywhere your target browser runs.

Simon Bisson

Author of InfoWorld's Enterprise Microsoft blog, Simon Bisson prefers to think of โ€œcareerโ€ as a verb rather than a noun, having worked in academic and telecoms research, as well as having been the CTO of a startup, running the technical side of UK Online (the first national ISP with content as well as connections), before moving into consultancy and technology strategy. Heโ€™s built plenty of large-scale web applications, designed architectures for multi-terabyte online image stores, implemented B2B information hubs, and come up with next generation mobile network architectures and knowledge management solutions. In between doing all that, heโ€™s been a freelance journalist since the early days of the web and writes about everything from enterprise architecture down to gadgets. He is the author of Azure AI Services at Scale for Cloud, Mobile, and Edge: Building Intelligent Apps with Azure Cognitive Services and Machine Learning.

More from this author