Ovako sa svakog sajta najpre treba da izvucem naslov. Najpre sam mislio da parsiram html i svaku recemicu posebno obradjujem tj. pasus. Naslov je uvek recenica koja opisuje ponudu pa sam sastavio i tablicu verovatnoce da ce neka recenica (pasus) odgovarati sablonu.
evo primera za recenice. http://www.groupon.com/deals/g...e-resort-and-spa?c=all&p=1 ---> "One-Night Stay for Four in a Superior Room with Waterpark Passes and Casino Credit. Up to Two Kids Stay Free."
http://www.groupon.com/deals/icons-4?c=all&p=0 -----> "Icons – St. John's Men's or Women's Salon Package, or Manicure with Paraffin Treatment (Up to 51% Off)"
http://www.livingsocial.com/es...918-south-hampton-roads-resort -----> "Channeling Our Forefathers in Virginia"
http://www.kolektiva.rs/beogra...bazenima-tenesi-popust-50.html -------> "Ovog leta uživajte i rashladite se na bazenima "Tenesi"! Popust 50%! Celodnevni boravak na bazenima, uživanje na ležaljkama + obrok!"
primer koda:
p = get_page('some url') //get source code from some url in string p
main_sentence(p) //make procedure that will extract main sentence from string p
Ima li neko ideju kako iz celog teksta prepoznati recenicu koja opisuje ponudu tj. glavnu recenicu.
tablica verovatnoce - koje jos parametre da dodam: