# Cory's Wiki

##### Data Sources

Advertisement text and titles Bid terms or keywords

May be separated into different types of queries (via Dan Russel):

• * Informational
• User wants to learn about a specific topic
• [bring me to] Bank of America (.com)
• * Transactional
• * other
• e.g. exploring a new topic
##### Applications

Have some inventory of advertisements available, and want to optimally match advertisements to users. Upon serving a webpage, need to decide what ads are most relevant to return, and possibly what location to place them.

Between web publishers and advertisers, there is a “whole ecosystem” of companies that optimize different parts of the ad-matching process.

In this domain, it is typically to deal in terms of “expected loss”.

• * Display or banner
• fixed content, usually visual
• * Sponsored search
• triggered by search results
• ad selection based on search query terms, user features, click-through rates, etc.
• * Context-based Text
• Can be based on content of web page during browsing
• ad selection is based on the page content

Simple expected revenue model: $E[\text{revenue}] = CTR_{ad} * CPC_{ad} = p(click|ad) * CPC_{ad}$

• * CTR is click-through rate
• * CPC is cost-per-click

It is critical to quickly predict the click-through rate. This can be modeled as a multi-armed bandit problem.

There is frequently a real-time auction for ad positions.

Old auctions had volatile pricing with rapid swings. This discouraged advertisers from participating. In 2002, Google introduced the idea of 2nd Price Auctions to the internet for their keyword bidding. Advertisers bid on K ranked positions. Everyone who wins is then charged by the bid of the person immediately below them in the ranking, with the last being charged a fixed minimum amount. This results in a more stable auction market because it automates the task bidders have of trying to pay as little as possible more than the person below them.

#### Web Development

May use results of web log analytics to restructure flow or design of the website.

#### Ranking Retrieved Documents for a Query

Given $N$ documents containing a query string Q, want to rank them for the user to give the most “relevant” first. This is the information retrieval problem.

#### Online Learning of Click-Through Rates

With K different ads with click-through rates $p_1 \dots p_k$, want to learn these CTRs so we can maximize expected revenue, but dont' want to lose too much potential revenue to do so.

This is a multi-armed bandit problem.

#### Web Log Analytics

A lot of exploratory data analysis on web logs.

Data can be skewed by robot traffic.

Use web logs to generate statistics on:

• * page visits
• * where users came from (referrals)
• * geographic distribution of visitors via IP addresses

### Learning Models

May use EM for Markov Clusters on traffic flows.

#### Click Fraud

Some ad clicks are done by non-humans. A combination of human analysis and machine learning algorithms want to detect click fraud.

This is a controversial topic, with advertisers claiming fraud rates over 20%.

#### Information Retrieval

Given query data:

• * Can create behavioral profiles for individual users
• * Infer user attributes or intents. “intent classification” (purchase, etc)
• * Use aggregate data for forecasting or trend detection