Advertisement
furas

Python - get text from HTML (FB: learningpython.org)

Oct 19th, 2016
246
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 3.02 KB | None | 0 0
  1. #
  2. # https://www.facebook.com/groups/learnpython.org/permalink/1224436087621006/
  3. #
  4. # http://stackoverflow.com/a/925630/1832058
  5. # https://docs.python.org/2/library/codecs.html
  6. #
  7.  
  8. data = [
  9.     '<div class="more"> <h3>Key Responsibilities</h3><ul><li>Product development: assist in new product development process, from concept through commercialization, by translating market needs into product requirements and establishing the added value,</li><li>Product launches: prepare marketing deliverables, develop scientific and technical product information and training for sales and customers, monitor product revenues, make sales forecasts,</li><li>Guiding sales: develop product positioning and marketing tools to effectively promote products and train sales teams accordingly. Collaborate with sales team to drive growth for the business segment(s).</li><li>Customer&nbsp; contacts:&nbsp; develop&nbsp; a&nbsp; customer&nbsp; network&nbsp; by&nbsp; performing&nbsp; regular&nbsp; site&nbsp; visits,&nbsp; training initiatives and congress participation.</li><li>Product regulatory and quality: assure internal procedures are followed and products are commercialized according to the company’s quality requirements.&nbsp;</li></ul><p>&nbsp;</p><h3>Profile</h3><ul><li>Master’s degree in life sciences (ideally molecular biology) or equivalent experience,</li><li>Understanding of molecular diagnostic markets (preferably genetics or pathology),</li><li>Customer-oriented and problem-solving,</li><li>Motivated to work in - and being flexible to adapt to an entrepreneurial environment,</li><li>Multilingual (Dutch, English, French),</li><li>Prepared to travel on a regular basis (up to 25%, in Europe).</li></ul><p>&nbsp;</p><h3>We offer</h3><p>Multiplicom offers a competitive compensation and benefits package, and strong leadership commitment to&nbsp;individual learning and personal development in an entrepreneurial environment.</p><p>&nbsp;</p><h3>How to apply?&nbsp;</h3> <div> </div> </div>',
  10.  
  11.     # entities
  12.     '&lt; &amp; &copy; &reg; &pound; &gt;',
  13.  
  14.     # no newline
  15.     'Hello<br/>World',
  16.  
  17.     # OK - bytes converted to string/unicode using cp1250
  18.     b'\x92'.decode('cp1250'),
  19.  
  20.     # OK - incorrect code converted to bytes
  21.     '\x92'.encode('raw_unicode_escape').decode('cp1250'),
  22.  
  23.     # error - incorrect unicode string when encode('cp1250'),
  24.     '\x92',                  
  25. ]
  26.  
  27. from html.parser import HTMLParser
  28.  
  29. class MLStripper(HTMLParser):
  30.     def __init__(self):
  31.         self.reset()
  32.         self.strict = False
  33.         self.convert_charrefs= True
  34.         self.fed = []
  35.     def handle_data(self, d):        
  36.         self.fed.append(d)
  37.     def get_data(self):
  38.         return ''.join(self.fed)
  39.  
  40. def strip_tags(html):
  41.     s = MLStripper()
  42.     s.feed(html)
  43.     return s.get_data()
  44.  
  45. for row in data:
  46.     print(strip_tags(row))
  47.     try:
  48.         # simulate print() in Windows where cmd.exe uses 'cp1250'
  49.         print(strip_tags(row).encode('cp1250'))
  50.     except Exception as e:
  51.         print('ERROR:', e)
  52.     print('-----')
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement