[C++11] std::regex를 이용해 정규식으로 특정 문자열을 추출해보자!

Modern C++이라고 보통 부르는 C++11 스펙에 정규식이 추가되었다. 따라서 이전까지는 std::string::find를 이용해 복잡한 조건과 플로우를 조합해야 검색할 수 있던 것들을 정규식으로 간단하게 찾아낼 수 있게 되었다.

하지만 C++에서 정규식을 곧바로 이용하기에는 많은 어려움이 있다. 일단 생긴지도 몇 년 되지 않았고 각종 컴파일러에서 이것을 제대로 지원하게 된 지도 얼마 되지 않아 다양한 상황에 대한 참고자료가 인터넷에 충분히 올라와 있지 않기 때문이다. 따라서 여기서 정규식을 다루는 김에 몇 가지 C++에 국한된 참고사항을 먼저 이야기하겠다.

이전에 C#에서 정규식을 쓸 땐 정규식 검사용 웹사이트로 [Rubular]를 이용했었는데 C++11의 정규식은 ECMAScript 스펙을 따라가고 있어서 먹히지 않는 부분이 있었다. 그래서 해당 스펙을 따르는 검사 페이지인 [regex101]를 이용하게 되었다.

참고로 regex101에선 정규식을 만족하는 모든 문자열을 찾고싶으면 정규식 입력 칸 오른쪽에 있는 칸(회색으로 gmi라 적혀있음)에 “g”를 적어놔야 한다.

정규식 검사 사이트에서 기껏 얻어낸 정규식을 바로 C++ 코드에 적용하려면 C++ 문자열 자체의 이스케이프를 또 적용해야 해서 짜증이 날 수 있다. 하지만 C++11부터 지원하는 [사용자 정의 원시 문자열 리터럴]을 이용하면 정규식을 그대로 적용할 수 있다. 아래 코드를 보자.

아래 정규식은 XML에서 특정 노드에 들어있는 xmlns 속성들을 모두 가져오는 것이다.

(xmlns:\w*)=”([\w:\/\-_#.]*)”

C++에서 위의 정규식을 std::regex 객체로 초기화하는 코드는 아래와 같다.

// <regex> 헤더를 포함해줘야 함
std::regex regexp(R”~((xmlns:\w*)=”([\w:\/\-_#.]*)”)~”);

L””은 봐도 R”~()~”은 처음 보는 사람들이 있을 것이다. 저게 바로 원시 문자열 리터럴의 표현 방법인데, 기본적으론 R”(내용)” 과 같은 형식으로 이용한다. 하지만 문자열 안에 -> ” <- 이런 큰따옴표 문자가 쓰이는 경우라면 다른 문자를 이용해 문자열의 끝을 알려줄 수 있는데, 본인은 ~를 이용해 충돌을 피한 것이다.

정규식의 문법은 생략한다. ECMAScript 정규식 문법은 [이 곳]을 참조하면 되고 정규식 자체를 모른다면 [여기]에서 공부해보자.

해당 정규식을 다음 XML 노드 문자열에 적용하면 아래와 같은 결과가 나온다. (regex101에서 가져왔음)

<?xml version="1.0" encoding="UTF-8"?>
<SOAP:Envelope xmlns:SOAP="http://schemas.xmlsoap.org/soap/envelope/" xmlns:b2b="http://www.kcfc.co.kr/schema/" xmlns:eb="http://www.oasis-open.org/committees/ebxml-msg/schema/smg-heder-2_0.xsd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schemas.xmlsoap.org/soap/envelope    http://www.oasis-open.org/committees/ebxml-msg/schema/msg-header-2_0.xsd">
</SOAP:Envelope>

<?xml version="1.0" encoding="UTF-8"?>

<SOAP:Envelope xmlns:SOAP="http://schemas.xmlsoap.org/soap/envelope/" xmlns:b2b="http://www.kcfc.co.kr/schema/" xmlns:eb="http://www.oasis-open.org/committees/ebxml-msg/schema/smg-heder-2_0.xsd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schemas.xmlsoap.org/soap/envelope http://www.oasis-open.org/committees/ebxml-msg/schema/msg-header-2_0.xsd">

</SOAP:Envelope>

MATCH 1
1.	`xmlns:SOAP`
2.	`http://schemas.xmlsoap.org/soap/envelope/`
MATCH 2
1.	`xmlns:b2b`
2.	`http://www.kcfc.co.kr/schema/`
MATCH 3
1.	`xmlns:eb`
2.	`http://www.oasis-open.org/committees/ebxml-msg/schema/smg-heder-2_0.xsd`
MATCH 4
1.	`xmlns:xlink`
2.	`http://www.w3.org/1999/xlink`
MATCH 5
1.	`xmlns:xsi`
2.	`http://www.w3.org/2001/XMLSchema-instance`

위의 표를 보면 하나의 정규식을 만족하는 그룹 안에 추출된 멤버가 두 개씩 있는데, 저렇게 된 이유는 위의 정규식에서 내가 추출하고 싶은 문자열을 (xmlns:\w*)와 ([\w:\/\-_#.]*)처럼 소괄호로 감싸주었기 때문이다.

정규식을 테스트 해봤으니 이제 C++ 코드로 직접 추출해볼 차례다. 아래 코드를 보자.

std::string str(R"~(<?xml version="1.0" encoding="UTF-8"?>
<SOAP:Envelope xmlns:SOAP="http://schemas.xmlsoap.org/soap/envelope/" xmlns:b2b="http://www.kcfc.co.kr/schema/" xmlns:eb="http://www.oasis-open.org/committees/ebxml-msg/schema/smg-heder-2_0.xsd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schemas.xmlsoap.org/soap/envelope    http://www.oasis-open.org/committees/ebxml-msg/schema/msg-header-2_0.xsd">
</SOAP:Envelope>)~");
std::regex regexp(R"~((xmlns:\w*)="([\w:\/\-_#.]*)")~");

const std::sregex_iterator itEnd;
for (std::sregex_iterator it(str.begin(), str.end(), regexp); it != itEnd; ++it)
{
    for (auto elem : *it) { cout << elem << endl; }
    cout << "==" << endl;
    cout << (*it)[1].str() << "=" << (*it)[2].str() << endl;
    cout << "===============================================" << endl;
}

std::string str(R"~(<?xml version="1.0" encoding="UTF-8"?>

</SOAP:Envelope>)~");

std::regex regexp(R"~((xmlns:\w*)="([\w:\/\-_#.]*)")~");

const std::sregex_iterator itEnd;

for (std::sregex_iterator it(str.begin(), str.end(), regexp); it != itEnd; ++it)

{

for (auto elem : *it) { cout << elem << endl; }

cout << "==" << endl;

cout << (*it)[1].str() << "=" << (*it)[2].str() << endl;

cout << "===============================================" << endl;

}

우선 검색할 문자열을 std::string 형으로 만든다. 그 다음 검사할 정규식을 std::regex 형으로 만든다.

그런 다음 std::sregex_iterator를 이용해 검색 루프를 만든다. 이 때 사용되는 반복자는 일반적인 STL의 반복자와 사용 방법이 다르니 해당 코드를 주의깊게 보기 바란다. 참고로 sregex_iterator의 기본값은 종료 반복자로 세팅되기 때문에 위의 코드를 따라하면 유일한 종료 반복자를 만들 수 있다.

첫 번째 for문은 C++11에서 생긴 [range_based for]문으로, 다른 문법에선 for each문이라고도 부르는 그것이다. 그 밑에 있는 코드를 보면 반복자인 it에서 배열 접근자를 이용해 1, 2번째 멤버를 출력했다. 0번째 멤버는 정규식 자체를 만족하는 문자열이고 1번 이상의 멤버는 자신이 소괄호 ()를 이용해 추가로 캡처한 문자열들이다. 아래 실행 결과를 보면 쉽게 이해할 수 있다.

xmlns:SOAP="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:SOAP
http://schemas.xmlsoap.org/soap/envelope/
==
xmlns:SOAP=http://schemas.xmlsoap.org/soap/envelope/
===============================================
xmlns:b2b="http://www.kcfc.co.kr/schema/"
xmlns:b2b
http://www.kcfc.co.kr/schema/
==
xmlns:b2b=http://www.kcfc.co.kr/schema/
===============================================
xmlns:eb="http://www.oasis-open.org/committees/ebxml-msg/schema/smg-heder-2_0.xsd"
xmlns:eb
http://www.oasis-open.org/committees/ebxml-msg/schema/smg-heder-2_0.xsd
==
xmlns:eb=http://www.oasis-open.org/committees/ebxml-msg/schema/smg-heder-2_0.xsd
===============================================
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xlink
http://www.w3.org/1999/xlink
==
xmlns:xlink=http://www.w3.org/1999/xlink
===============================================
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsi
http://www.w3.org/2001/XMLSchema-instance
==
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
===============================================

xmlns:SOAP="http://schemas.xmlsoap.org/soap/envelope/"

xmlns:SOAP

http://schemas.xmlsoap.org/soap/envelope/

xmlns:SOAP=http://schemas.xmlsoap.org/soap/envelope/

===============================================

xmlns:b2b="http://www.kcfc.co.kr/schema/"

xmlns:b2b

http://www.kcfc.co.kr/schema/

xmlns:b2b=http://www.kcfc.co.kr/schema/

===============================================

xmlns:eb="http://www.oasis-open.org/committees/ebxml-msg/schema/smg-heder-2_0.xsd"

xmlns:eb

http://www.oasis-open.org/committees/ebxml-msg/schema/smg-heder-2_0.xsd

xmlns:eb=http://www.oasis-open.org/committees/ebxml-msg/schema/smg-heder-2_0.xsd

===============================================

xmlns:xlink="http://www.w3.org/1999/xlink"

xmlns:xlink

http://www.w3.org/1999/xlink

xmlns:xlink=http://www.w3.org/1999/xlink

===============================================

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns:xsi

http://www.w3.org/2001/XMLSchema-instance

xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance

===============================================

이상의 코드로 내가 원하는 결과를 깔끔하게 가져올 수 있었다.

위의 코드로는 특정 정규식을 만족하는 문자열을 “모두” 가져올 수 있었지만 [std::regex_search()]를 이용하면 첫 번째 검색결과만 가져올 수 있으니 참고하기 바란다.

그리고 단순히 특정 문자열이 특정 정규식을 만족하는지 여부만 확인하고 싶다면 [std::regex_match()]를 이용하면 되고, 특정 정규식을 만족하는 문자열을 원하는 문자열로 바꾸고 싶다면 [std::regex_replace()]를 이용하면 되니 이것도 참고바란다. (예제랑 같이 설명하기 귀찮아서 그냥 링크로 대체했다.)

C++에서의 정규식은 ECMAScript 문법을 따르기 때문에 다른 곳에서 먹히는 문법이라고 C++에서 먹힌다는 보장은 없다는 점을 꼭 유념하기 바란다.

답글 남기기 응답 취소