1. How does ranked retrieval differ from SQL access to a database? 2. Explain the following concepts in relation to the operation of Google: term, document, indexing, querying 3. The vector model is generally thought to be superior to the boolean model - why? 4. Why are boolean models more popular then vector models in practice? 5. Assume the following document collection, with id prefixed 1: a b c a b c a b c 2: a a a a a 3: b c b c b c b c 4: a b c 5: c a) Build inverted files for all terms b) Using the boolean model, calculate how filtering would be applied to the document collection for each of the queries "a b c" and "b", assuming terms are ORed b) Calculate the (extended boolean model) rank of each document to the queries "a b c" and "b", assuming terms are ORed during filtering. Use a reasonable formula and state your assumptions. 6. What is LSI and why is it useful? 7. Explain what tf.idf is. 8. How do each of the following affect recall vs precision? - stemming - stopping - case folding - full text instead of just metadata - thesauri 9. Consider the following link graph: +-------> a ---------+ | | | | | | \/ \/ d -------> e <------- f /\ |/\ /\ | | | | | | | | | | | \/ <------+ +------- g <---------------- i a) Apply PageRank to the graph b) Apply HITS to the graph (in both cases stop after 5 iterations, and start with equal values of 1/6 for each node) 10. What is an authority and what is a hub? 11. HITS works at runtime but PageRank works at index time. Discuss the implications.