Class website For each week, write summaries of the listed reading material. 1-page summary for each article, including an objective summary of the article content followed by my commentary.
Information Retrieval: Mainly developed through the world-wide web. Classic assumptions: Static corpus. Retrieve information relevant to a query.
Web assumptions: dynamic content, adversarial entities, advertising, non-text documents
Real-time: parse potential strings for a match. On the web: pre-search potential strings and prepare an index
Search engines need to
Search Engine Optimization Practice of making content index higher in search engines by tailoring the interface to their crawlers. This can be done legitimately or illegitimately. Search engines provide guidelines. *Legitimate
Old generation search engines relied heavily on text term-frequency/inverse-document-frequency ratios. This was exploited by “keyword stuffing” by website SEO rankings. Scripts are used to exploit systems that count the number of clicks on a search result or on an advertisement. To exploit link-function rankings, unrelated sites may do “link exchange” where they link to each other.
[http://www.fleiner.com/bots/ How to create crawler traps]
Web data collection: Data dumps, url downloads, web APIs, crawling.
Crawler traps: web server responds with ever changing URLs and content. Maybe be intentional or unintentional.
Duplicate detection: Studies: 30% of web content is a duplicate of some of the other 70%. Exact duplicates online are very rare. Near-duplicate detection is an open problem, presently solved through fingerprinting
techniques.
Crawlers don't go to the Deep Web: content behind login forms, JavaScript, not linked from elsewhere. The Deep Web is estimated to be larger than the shallow web.
Web is almost 20 years old, so laws that apply to it tend to be outdated or untested. There are more useful conventions than actual laws.
Terms of Use
or Terms of Service
Guest lecture by Chen Li. Where traditional databases have an index architecture for queries, web search relies on dynamic content that is not as easily indexed. Databases are too slow for web search, and unoptimized for ranking. Vertical search: search engine focused on specific content Horizontal search: general search engine that tries to cover everything Requirements provided by web searches: instant results, fuzzy search, personalization.
See http://labs.google.com/people/jeff Optimization opportunities: don't store documents in separate files to lessen open/close times. The total time to read and then decompress data is usually less than reading uncompressed data.
Sparse matrices represent bag-of-words models for each known document. The index is another vector-space representation of term-document pairs. A (term, [list]) pair is called a posting, where [list] is the list of all documents in which the term occurs.
The index must then be sorted in order to optimize future searches. Block merge sort strategies handle this task when system memory is limited.
BSBI (Block Sort-Based Indexing) uses this sorted list of postings to retrieve queries [http://swtch.com/~rsc/regexp/regexp4.html RE2]. The document-term matrix is $O(T \log T)$ where $T$ is the number of term-document pairs.
Quiz #2
Distributed indexing: “Map-reduce”. Phase one (map) splits the input workload among multiple parsers, which create independent indexes. Phase two (reduce) uses “inverters” over all the resulting indexes to combine them. It is a parallel implementation of merge sort.
Within context of the query, want to rank ~all~ documents in the corpus. (In practice, use heuristics to narrow down the set that are actually processed.)
Index structure: Greater length n-grams create larger lexicons. 3-grams tend to work for smaller data sets (Google Code Search). Support proximity matching by ordering words in the index by their document order.
Compression: When doing mostly reads, decompressing and reading from disk tends to be faster than reading uncompressed from disk.
Lisp style: Partition the file into groups, using one inverter.
Hadoop style: Two mappings. The first one that splits the sorting workload, Then the second mapping is to a set of inverters which create inverted indices. In this case, each inverter is only responsible for a sub-set of the “types”.
Hadoop is an open-source architecture for large-scale distributed and data-intensive processing.
File system uses a “namenode” in additiona to distributed “data nodes”. The namenode acts as a server of data locations, returning the data node and block location of the target file.
Boolean retrieval: retrieve a bit-vector for each query word over documents (1 if document contains word; 0 otherwise). Perform boolean AND/OR over resulting bit-vectors to produce list of primary documents to return. Performing a Boolean operation over two bit-vectors can be a linear operation of the number of postings if they are in sorted order. This is in contrast to being O(number of documents total). Starting with most restrictive expressions can help to eliminate documents early on in the process.
Frequency ranking: Now instead of a bit vector over documents, each document returns an integer count. These values must be mapped to some ranking score that will allow them to be compared. log(frequency) is usually used. Inverse-document frequency is used to counter-balance words that occur frequently throughout the corpus. IDF is only useful for ranking terms relative to each other in multi-word queries.
[http://en.wikipedia.org/wiki/Jaccard_coefficient Jaccard Coefficient] is an alternative ranking method that doesn't account for frequency.
[http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html#fig:ROC-curve Normalized discounted cumulative gain]: $NDCG(Q,k) = \frac{1}{Q} \sum_{j=1}^{|Q|} Z_{kj} \sum_{m=1}^k \frac{2^{R(j,m)} - 1}{log_2(1+m)}$
Relevance of retrieved documents
Amount of relevant retrieved documents
Plot documents in a vector space defined over term frequencies. In this space, a “similarity” measure can be performed easily using the cosine-distance between vectors. Cosine is ideal because it decreases monotonically from +1 to -1 between 0 and 180. $cos(\theta) = \frac{V(d_1) \cdot V(d_2)}{|V(d_1)| |V(d_2)|}$. In this case, vector length should be normalized.
[http://en.wikipedia.org/wiki/Cosine_similarity Cosine Similarity]: $a \cdot b = ||a|| ||b|| cos\theta$ $a \cdot b = \frac{sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n (A_i)^2} \sqrt{\sum_{i=1}^n (B_i)^2}}$
To perform queries, treat the query as a short document. Return documents ranked by distance to the query in the vector space. This is a very fast operation relative to other ranking methods.
Gold standard
for evaluation: does relevance have to be determined through human judgement?
Results depend on the task evaluated, are usually restricted to binary responses, and there may not be good agreement on the “answer”.
Popular trend is to use the [http://en.wikipedia.org/wiki/Harmonic_Mean Harmonic Mean] of precision and recall. $F = \frac{2RP}{R+P}$ A more general form allows weighting between the two: $F_\beta = \frac{(\beta^2 + 1)RP}{R+\beta^2P}$
Another method is to use precision-at-N
or recall-at-N
as a method of focusing on the primary results. A simplified and weaker method is to compute the average precision over some top-N range. When this average is averaged over multiple queries, the result is called Mean average precision or map.
Discounted gain: $DG_i = R_i / \log_2(i)$
Discounted cumulative gain: $DCG_p = \sum_{i=1}^p \frac{2^{R_i} - 1}{\log(1+i)}$
Normalized discounted cumulative gain:
$NDCG_p = \sum_{i=1}^p \frac{2^{R_i} - 1}{\log(1+i)} / \sum_{i=1}^p \frac{2^R_i - 1}{\log(1+i)}$
Where $R_i$ is the relevance score for document i
in an ideal ranking.
Assign relevance score inversely to the Google result. So the top 5 Google results get relevance scores of 5,4,3,2,1 in that order.
Cosine similarity scores are usually computed over dimensions defined by the TF-IDF score for each term. Heap objects help for retrieving top K results.
Heuristics for improving performance: Want to avoid computing the cosine-distance.
authority
of the source. Should up-weight authoritative sources such as wikipediaA recent trend is to use machine learning over these features to better tune document rankings.
Graph heuristics:
Latent semantic analysis and latent semantic indexing.
LSI
developed around 1989, is the use of LSA
for information retrieval indexing.
LSA always reduces the dimensions of the data. The goal is to do so in a way that preserves useful trends in the data. Singular value decomposition is a useful principle component analysis tool which extracts the eigenvectors and eigenvalues from the data matrix.
[http://en.wikipedia.org/wiki/Singular_value_decomposition SVD]: $D \to U \Sigma V$
Industry stuff. Web IR is big $$$.