Server logs Cookies
Advertisement text and titles Bid terms or keywords
May be separated into different types of queries (via Dan Russel):
Have some inventory of advertisements available, and want to optimally match advertisements to users. Upon serving a webpage, need to decide what ads are most relevant to return, and possibly what location to place them.
Between web publishers and advertisers, there is a “whole ecosystem” of companies that optimize different parts of the ad-matching process.
In this domain, it is typically to deal in terms of “expected loss”.
Types of ads:
Simple expected revenue model: $E[\text{revenue}] = CTR_{ad} * CPC_{ad} = p(click|ad) * CPC_{ad}$
It is critical to quickly predict the click-through rate. This can be modeled as a multi-armed bandit problem.
There is frequently a real-time auction for ad positions.
Old auctions had volatile pricing with rapid swings. This discouraged advertisers from participating. In 2002, Google introduced the idea of 2nd Price Auctions to the internet for their keyword bidding. Advertisers bid on K ranked positions. Everyone who wins is then charged by the bid of the person immediately below them in the ranking, with the last being charged a fixed minimum amount. This results in a more stable auction market because it automates the task bidders have of trying to pay as little as possible more than the person below them.
May use results of web log analytics to restructure flow or design of the website.
Given $N$ documents containing a query string Q, want to rank them for the user to give the most “relevant” first. This is the information retrieval problem.
With K different ads with click-through rates $p_1 \dots p_k$, want to learn these CTRs so we can maximize expected revenue, but dont' want to lose too much potential revenue to do so.
This is a multi-armed bandit problem.
A lot of exploratory data analysis on web logs.
Data can be skewed by robot traffic.
Use web logs to generate statistics on:
May use EM for Markov Clusters on traffic flows.
Some ad clicks are done by non-humans. A combination of human analysis and machine learning algorithms want to detect click fraud.
This is a controversial topic, with advertisers claiming fraud rates over 20%.
Given query data: