DEL I: VITENSKAPELIG ARTIKKEL
4.1 V ALIDITY AND REPRESENTATIVENESS
Como trabalho futuro, pretende-se estudar formas de eliminar ou configurar automati- camente os parˆametros utilizados no m´etodo proposto. Al´em disso, ´e preciso aperfei¸coar o processo de remo¸c˜ao de SRRs discrepantes de forma que os elementos que represen- tem SRRs mas que apresentem muito mais ou muito menos atributos que os demais SRRs n˜ao sejam eliminados erroneamente. Quanto `a extra¸c˜ao de atributos, pretende-se desenvolver um m´etodo para a rotulagem autom´atica dos mesmos e para a detec¸c˜ao e separa¸c˜ao de valores de atributos multi-valorados (que contˆem v´arios conceitos dentro
do seu conte´udo, separados por algum elemento ou caractere).
Para diminuir o tempo necess´ario para a extra¸c˜ao dos registros e seus atributos em determinadas p´aginas, o m´etodo pode ser aperfei¸coado para gerar regras de extra¸c˜ao, chamadas de wrapper, que possa ser utilizado para extrair os dados de v´arias p´aginas de um mesmo website sem que o algoritmo inteiro precise ser executado novamente.
Referˆencias Bibliogr´aficas
Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M.-C., and Senellart, P. (2012). Web
Data Management. Cambridge University Press. Open access of the full text on the
Web.
Adelberg, B. (1998). Nodose: A tool for semi-automatically extracting structured and semistructured data from text documents. Special Interest Group on Management of
Data - SIGMOD Rec., 27(2):283–294.
´
Alvarez, M., Pan, A., Raposo, J., Bellas, F., and Cacheda, F. (2008). Extracting lists of data records from semi-structured web pages. Data and Knowledge Engineering, 64(2):491–509.
Arasu, A. and Garcia-Molina, H. (2003). Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management
of Data, SIGMOD ’03, pages 337–348, San Diego, California. ACM.
Arocena, G. O. and Mendelzon, A. O. (1999). Weboql: Restructuring documents, da- tabases, and webs. Theor. Pract. Object Syst., 5(3):127–141.
Baeza-Yates, R. A. and Ribeiro-Neto, B. (1999). Modern Information Retrieval.
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. (2003). Extracting content structure for web pages based on visual representation. In Proceedings of the 5th Asia-Pacific Web
Conference on Web Technologies and Applications. Xian, China., APWeb’03, pages
406–417, Xian, China. Springer-Verlag.
Califf, M. E. and Mooney, R. J. (1999). Relational learning of pattern-match rules for in- formation extraction. In Proceedings of the Sixteenth National Conference on Artificial
REFERˆENCIAS BIBLIOGR ´AFICAS 61
Intelligence and the Eleventh Innovative Applications of Artificial Intelligence Con- ference Innovative Applications of Artificial Intelligence, AAAI ’99/IAAI ’99, pages
328–334, Menlo Park, CA, USA. American Association for Artificial Intelligence. Chang, C.-H. and Lui, S.-C. (2001). Iepad: Information extraction based on pattern
discovery. In Proceedings of the 10th International Conference on World Wide Web, WWW ’01, pages 681–688, Hong Kong, Hong Kong. ACM.
Crescenzi, V. and Mecca, G. (1998). Grammars have exceptions. Inf. Syst., 23(9):539– 565.
Crescenzi, V., Mecca, G., and Merialdo, P. (2001). Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th International Conference on
Very Large Data Bases, VLDB ’01, pages 109–118, Roma, Italy. Morgan Kaufmann
Publishers Inc.
de Kok, D. and Brouwer, H. (2010). Natural Language Processing for the Working
Programmer. Dispon´ıvel online em http://nlpwp.org/book/.
Embley, D. W., Campbell, D. M., Jiang, Y. S., Liddle, S. W., Lonsdale, D. W., Ng, Y.- K., and Smith, R. D. (1999). Conceptual-model-based data extraction from multiple- record web pages. Data and Knowledge Engineering, 31(3):227–251.
Ferrara, E., De Meo, P., Fiumara, G., and Baumgartner, R. (2014). Web dat extraction, aplications and techniques: A survey. Knowledge-Based Systems, 70:301–323.
Freitag, D. (2000). Machine learning for information extraction in informal domains.
Mach. Learn., 39(2-3):169–202.
Fumarola, F., Weninger, T., Barber, R., Malerba, D., and Han, J. (2011). Extracting general lists from web documents: A hybrid approach. In Proceedings of the 24th
International Conference on Industrial Engineering and Other Applications of Ap- plied Intelligent Systems Conference on Modern Approaches in Applied Intelligence - Volume Part I, IEA/AIE’11, pages 285–294, Syracuse, NY. Springer-Verlag.
Grigalis, T. (2013). Towards web-scale structured web data extraction. In Proceedings
of the Sixth ACM International Conference on Web Search and Data Mining, WSDM
’13, pages 753–758, Rome, Italy. ACM.
Hammer, J., McHugh, J., and Garcia-Molin, H. (1997). Semistructured data: The tsimmis experience. In Proceedings of the First East-European Conference on Advances
62 REFERˆENCIAS BIBLIOGR ´AFICAS
in Databases and Information Systems, ADBIS’97, pages 22–22, Swinton, UK, UK.
British Computer Society.
Han, J. and Kamber, M. (2006). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
He, H., Meng, W., Zhao, H., and Yu, C. (2007). Annotating structured data of the deep web. In In: Proc. of the IEEE 23rd International Conference on Data Engineering, pages 376–385, Istanbul, Turkey. Society Press.
Hiremath, P. S. and Algur, S. P. (2009). Extraction of data from web pages: A vi- sion based approach. International Journal on Computer Science and Engineering, 1(3):50–59.
Hsu, C.-N. and Dung, M.-T. (1998). Generating finite-state transducers for semi-
structured data extraction from the web. Information Systems, 23(9):521–538. Irmak, U. and Suel, T. (2006). Interactive wrapper generation with minimal user effort.
In Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pages 553–563, Edinburgh, Scotland. ACM.
Jabour, I. V. (2010). Impacto de atributos estruturais na identifica¸c˜ao de tabelas e listas em documentos html. Mestrado, Pontif´ıcia Universidade Cat´olica do Rio de Janeiro. Departamento de Inform´atica.
Kadam, V. B. and Pakle, G. K. (2014). Deuds: Data extraction using dom tree and selectors. International Journal of Computer Science and Information Technologies, 5(2):1403–1410.
Kr¨upl-Sypien, B., Fayzrakhmanov, R. R., Holzinger, W., Panzenb¨ock, M., and Baum-
gartner, R. (2011). A versatile model for web page representation, information ex- traction and content re-packaging. In Hardy, M. R. B. and Tompa, F. W., editors,
ACM Symposium on Document Engineering, pages 129–138, Mountain View, CA,
USA. ACM.
Kushmerick, N. (1997). Wrapper Induction for Information Extraction. PhD thesis, University of Washington. AAI9819266.
Laender, A. H. F., Ribeiro-Neto, B., and da Silva, A. S. (2002a). Debye - data extraction by example. Data and Knowledge Engineering, 40(2):121–154.
REFERˆENCIAS BIBLIOGR ´AFICAS 63
Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., and Teixeira, J. S. (2002b). A brief survey of web data extraction tools. SIGMOD Rec., 31(2):84–93.
Li, L., Liu, Y., and Obregon, A. (2007). Visual segmentation-based data record extrac- tion from web documents. In Information Reuse and Integration, pages 502–507, Las Vegas, IL.
Liu, B., Grossman, R., and Zhai, Y. (2003). Mining data records in web pages. In Proce-
edings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, pages 601–606, Washington, D.C. ACM.
Liu, L., Pu, C., and Han, W. (2000). Xwrap: An xml-enabled wrapper construction sys- tem for web information sources. 16th International Conference on Data Engineering
(ICDE’00), 0:611.
Liu, W., Meng, X., and Meng, W. (2010). Vide: A vision-based approach for deep web data extraction. IEEE Trans. on Knowl. and Data Eng., 22(3):447–460.
Metz, J. (2006). Interpreta¸c˜ao de clusters gerados por algoritmos de clustering
hier´arquico. Mestrado, Universidade de S˜ao Paulo. Instituto de Ciˆencias Matem´aticas e de Computa¸c˜ao.
Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., and Moser, L. E. (2009). Extracting data records from the web using tag path clustering. In Proceedings of the 18th
International Conference on World Wide Web, WWW ’09, pages 981–990, Madrid,
Spain. ACM.
Muslea, I., Minton, S., and Knoblock, C. (1998). Stalker: Learning extraction rules for semistructured. In In American Association for Artificial Intelligence (AAAI):
Workshop on AI and Information Integration, Madison, Wisconsin, USA.
Pawlik, M. and Augsten, N. (2011). Rted: A robust algorithm for the tree edit distance.
Very Large Data Bases (VLDB) Endowment, 5(4):334–345.
Sahuguet, A. and Azavant, F. (1999). Wysiwyg web wrapper factory (w4f). In Procee-
dings of the 8th International Conference on World Wide Web, pages 1–22, Toronto,
Canada.
Simon, K. and Lausen, G. (2005). Viper: Augmenting automatic information extraction with visual perceptions. In Proceedings of the 14th ACM International Conference
64 REFERˆENCIAS BIBLIOGR ´AFICAS
on Information and Knowledge Management, CIKM ’05, pages 381–388, Bremen,
Germany. ACM.
Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Mach. Learn., 34(1-3):233–272.
Song, R., Liu, H., Wen, J.-R., and Ma, W.-Y. (2004). Learning block importance models for web pages. In Proceedings of the 13th International Conference on World Wide
Web, WWW ’04, pages 203–211, New York, NY, USA. ACM.
Trieschnigg, R. B., Tjin-Kam-Jet, K. T. T. E., and Hiemstra, D. (2012). Ranking xpaths for extracting search result records. Technical Report TR-CTIT-12-08, Centre for Telematics and Information Technology, University of Twente, Enschede.
Velloso, R. P. and Dorneles, C. F. (2013). Automatic web page segmentation and noise removal for structured extraction using tag path sequences. Journal of Information
and Data Management, 4(3):173–187.
Wang, J. and Lochovsky, F. H. (2003). Data extraction and label assignment for web databases. In Proceedings of the 12th International Conference on World Wide Web, WWW ’03, pages 187–196, Budapest, Hungary. ACM.
Zhai, Y. and Liu, B. (2005a). Net - a system for extracting web data from flat and nested data records. In Web Information Systems Engineering - WISE 2005, pages 487–495, New York, USA. Springer Berlin Heidelberg.
Zhai, Y. and Liu, B. (2005b). Web data extraction based on partial tree alignment. In
Proceedings of the 14th International Conference on World Wide Web, WWW ’05,
pages 76–85, Chiba, Japan. ACM.
Zhao, H., Meng, W., Wu, Z., Raghavan, V., and Yu, C. (2005). Fully automatic wrapper generation for search engines. In Proceedings of the 14th International Conference on